Evaluating generalisation in Natural Language Processing models


Good generalisation is of utmost importance for any artificial intelligence model. Traditionally, the generalisation capabilities of machine learning models are evaluated using random train/test splits. However, numerous recent studies have exposed substantial generalisation failures in models that perform well on such random within-distribution splits. For instance, a model classifying toxic language might work well for posts written by white male users, but drastically fail when considering comments from black female users. So, if random splitting is not good for measuring how robustly models generalise to different scenarios, how should we evaluate that? In this talk, I discuss different types of generalisation and the challenges in evaluating them, focussing specifically on natural language processing models.


Dieuwke Hupkes is a research scientist at Facebook AI Research in FAIR. Her work centers around the evaluation of models of natural language processing (NLP), with a specific focus on how such models can show more human-like behaviour, where they fail and what are areas where they should still improved. In the recent past, she has focussed specifically on large language models (LLMs) and neural machine translation (NMT) models.

Open Data Science




Open Data Science
One Broadway
Cambridge, MA 02142

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Consent to display content from - Youtube
Consent to display content from - Vimeo
Google Maps
Consent to display content from - Google