Large Model Quality and Evaluation


Large model development has required substantial updates to approaches for ML quality and evaluation, given the challenges of the coverage, scale, and wide use cases for what large models are used for. This presentation discusses the challenges of evaluating large models, with case examples from large language models (LLMs) and large multimodal models in particular. Contrasting LLM development with NLP and Search quality evaluation from just a few years ago is instructive. Earlier human-in-the-loop development, benchmarking, and functional testing, all played a significant role in the success of past quality improvement efforts. These techniques are still relevant but are changing. Metrics for evaluating LLMs are evolving rapidly, and there are open questions about what constitutes "quality" for large models. Some of the challenges include developing and using benchmarks that are robust to data contamination, creating human-based evaluation criteria, and assessing the responsibility of LLMs. The presentation also highlighted some promising approaches to evaluating LLMs, including using benchmarks, human input, considering factors like complexity, multilinguality, and responsibility.

Session Outline:

How do you evaluate the quality of an LLM?
How do you evaluate the quality of a multimodal model?
What to measure and how?
What does human input look like?
What does a functional test look like?
What does responsibility evaluation look like?

This will cover inspiration from open source large models such as Gemma, Llama, and others.


Anoop Sinha is Research Director, AI & Future Technologies at Google. His current interests include Data-centric AI and the frontier of user interfaces, with a focus on large model quality and applications that have the potential for beneficial impact on society. Prior to Google, he was at FAIR at Meta and led Machine Learning for Siri at Apple. Anoop has a PhD from UC Berkeley in Computer Science, with a Human Computer Interaction and AI focus, and a BS from Stanford University. After his PhD, he worked at McKinsey & Co., taught at UC Berkeley, and co-founded a venture-backed start-up.

Open Data Science




Open Data Science
One Broadway
Cambridge, MA 02142

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Consent to display content from - Youtube
Consent to display content from - Vimeo
Google Maps
Consent to display content from - Google