Abstract: Evaluation in machine learning (ML) product development is a rich topic with a long history. However, Large language models (LLMs) represent a significant deviation from the known path and introduce a lot of unknowns. Since the same LLM can be flexibly applied in a wide range of contexts both with and without additional tuning, its evaluation must reflect this increased scope. Moreover, since LLMs output natural language instead of discrete classes, we must shift our evaluation focus from classic metrics like accuracy and F1 scores to complex concepts like usefulness, attribution, factuality, and safety.
Given this new paradigm, how can we build on long-standing best practices of evaluation, learn from academic research, and build solid evaluation pipelines for LLMs? Furthermore, we must consider the important role that humans play in model evaluations and determine what can be automated -- and whether it should be.
In this talk, I will discuss these questions alongside common pitfalls, opportunities, and best practices related to including large language models as an additional ingredient in product development.
Bio: Sebastian Gehrmann is the Head of NLP in the Office of the CTO at Bloomberg, where he contributes to and guides the strategy for the development of language technology across the company. His research interests range from natural language generation to model evaluation. He has worked on large language models like BloombergGPT, BLOOM, as well as PaLM and PaLM 2.
Before joining Bloomberg, Sebastian was a senior researcher at Google. He holds a Ph.D. in computer science from Harvard University.