Abstract: According to industry surveys, the number one hassle of data scientists is cleaning the data to analyze it. We will focus on the specific problem of missing values, in a prediction settings. We will show how machine-learning practice can be adapted to work data tables with missing values. We will start from classic missing-value result from statistics and show how they transfer to supervised learning. We will then discuss specific machine-learning methods suited for prediction with missing values. From a statistical point of view the supervised learning settings leads to different tradeoff that the classic statistical results.
Bio: Gaël Varoquaux is a research director working on data science and health at Inria (French Computer Science National research). His research focuses on using data and machine learning for scientific inference, with applications to health and social science, as well as developing tools that make it easier for non-specialists to use machine learning. He has long applied it to brain-imaging data to understand cognition. Years before the NSA, he was hoping to make bleeding-edge data processing available across new fields, and he has been working on a mastermind plan building easy-to-use open-source software in Python. He is a core developer of scikit-learn, joblib, Mayavi and nilearn, a nominated member of the PSF, and often teaches scientific computing with Python using the scipy lecture notes.