Practical Challenges in LLM Evaluation


Evaluation is critical in both the development and successful downstream deployment of LLMs. It can let you compare models for your use case, to assess whether model quality is improving during development or performing favorably to other models in comparison.

However, many common pitfalls can plague evaluation, causing results to be subtly incorrect, unfair across models, or otherwise misleading or inaccurate. As a developer and researcher studying LLMs who maintains an open source tool for orchestrating and standardizing LLM evaluation best practices, I'll be discussing many of the common pitfalls and challenges to be aware of when evaluating models, ranging from inconsistencies in prompting, tokenization failures, overfitting to benchmarks, data contamination, poor benchmark quality and more. We'll also go over the basic building blocks of evaluation, ranging from free-form text generation, to perplexity, to loglikelihood-based multiple choice evaluation, as well as other evaluation methods including model-based evaluation (LLM-as-a-judge) or preference-based evaluations, adversarial testing, and multimodal and agent or environment-based evals.

You should leave this talk thoroughly aware of just how many things in evaluation can go wrong, and how to set yourself up for success in avoiding these common mistakes when evaluating your LLMs during development or for production use cases.

Session Outline:

- Module 1: Types of LLM Outputs

This will recap the relevant facts about LLMs needed to understand the rest of the tutorial, and will

This module will be the most (pseudo)code-heavy module and will follow a stripped-down portion of 's internals to catch all participants

- Module 2: How can we perform Evaluation?

This will cover the ways we can apply the different outputs we can elicit from LLMs to measure their properties and performance, and some of the archetypes that various existing evaluations fall into.

- Module 3: How can benchmarking go wrong? How can you prevent this?

This will cover many of the challenges faced by practitioners looking to reliably evaluate their models: for example, data contamination, overfitting or overoptimization (""Goodhart's Law""), and other challenges, paired with case studies showing where these failings have popped up concretely in the LLM evaluation field.

- Module 4: Best Practices For Evaluation

This module will cover how practitioners can avoid many of the challenges or pitfalls to the greatest extent possible using good practices for developing and using evaluation metrics.


Hailey Schoelkopf (she/her) is a Research Scientist at EleutherAI, with interests in LLM pretraining, large-scale dataset design, and evaluation and benchmarking. Her interests include improving transparency around large-scale models such as LLMs, especially their datasets, and understanding how model trainers' choices, especially in data curation and filtering, have downstream impacts and can be done more intentionally. Aside from training and open-sourcing LLMs such as the Pythia models, her work also includes maintaining open-source libraries for the community: she is a core maintainer of EleutherAI's LM Evaluation Harness framework for standardizing and easing LLM evaluation, and also a contributor to the GPT-NeoX LLM pretraining framework.

Open Data Science




Open Data Science
One Broadway
Cambridge, MA 02142

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Consent to display content from - Youtube
Consent to display content from - Vimeo
Google Maps
Consent to display content from - Google