Hailey Schoelkopf

Hailey Schoelkopf

Research Scientist at EleutherAI

    Hailey Schoelkopf (she/her) is a Research Scientist at EleutherAI, with interests in LLM pretraining, large-scale dataset design, and evaluation and benchmarking. Her interests include improving transparency around large-scale models such as LLMs, especially their datasets, and understanding how model trainers' choices, especially in data curation and filtering, have downstream impacts and can be done more intentionally. Aside from training and open-sourcing LLMs such as the Pythia models, her work also includes maintaining open-source libraries for the community: she is a core maintainer of EleutherAI's LM Evaluation Harness framework for standardizing and easing LLM evaluation, and also a contributor to the GPT-NeoX LLM pretraining framework.

    All Sessions by Hailey Schoelkopf

    Day 2 extra event 04/24/2024
    5:45 pm - 9:30 pm

    Roulette Lightning Talks

    The East 2024 Roulette Lightning Talks offer a unique twist on presentations, injecting a dose of surprise and fun into the conference, as speakers must be prepared to speak about any of the slides that will be on the presentation, regardless of whether it is their own slide or not.

    Day 1 04/23/2024
    3:30 pm - 4:30 pm

    Practical Challenges in LLM Evaluation

    <span class="etn-schedule-location"> <span class="firstfocus">LLMs</span> </span>

    Tutorial: Evaluation is critical in both the development and successful downstream deployment of LLMs. It can let you compare models for your use case, to assess whether model quality is improving during development or performing favorably to other models in comparison. However, many common pitfalls can plague evaluation, causing results to be subtly incorrect, unfair across models, or otherwise misleading or inaccurate. As a developer and researcher studying LLMs who maintains an open source tool for orchestrating and standardizing LLM evaluation best practices, I'll be discussing many of the common pitfalls and challenges to be aware of when evaluating models, ranging from inconsistencies in prompting, tokenization failures, overfitting to benchmarks, data contamination, poor benchmark quality and more. We'll also go over the basic building blocks of evaluation, ranging from free-form text generation, to perplexity, to loglikelihood-based multiple choice evaluation, as well as other evaluation methods including model-based evaluation (LLM-as-a-judge) or preference-based evaluations, adversarial testing, and multimodal and agent or environment-based evals. You should leave this talk thoroughly aware of just how many things in evaluation can go wrong, and how to set yourself up for success in avoiding these common mistakes when evaluating your LLMs during development or for production use cases.

    Open Data Science




    Open Data Science
    One Broadway
    Cambridge, MA 02142

    Privacy Settings
    We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
    Consent to display content from - Youtube
    Consent to display content from - Vimeo
    Google Maps
    Consent to display content from - Google