
Abstract: Missing values are ubiquitous in the data analysis practice.
In this presentation, we will share our experience on the topic.
We will start by classical methods (single imputation, multiple imputation, likelihood based methods) developed in the inferential framework, where the aim is to estimate at best the parameters and their variance in the presence of missing data.
Then we will present recent results in a supervised-learning setting. A striking one is that naive imputation strategies (such as mean imputation) can be optimal, as the supervised-learning model does the hard work. That such a simple approach can be relevant may have important consequences in practice. We will also discuss how missing-value modeling can be readily incorporated into tree models, such as gradient-boosted trees, giving a learner that has been shown to perform very well, including difficult missing-not-at-random settings.
Notebooks in R and python will be presented.
Session Outline
1) Type of missing values 2) Single Imputation 3) Multiple Imputation 4) Supervised learning with missing values
Background Knowledge
basic stat and ML knowledge (linear regression, PCA, random forest, etc.)
Bio: Julie Josse is a senior researcher in statistics and machine learning applied to health at Inria, a French research institute in digital sciences, and Professor at Ecole Polytechnique (Paris). She is an expert in the treatment of missing values (inference, multiple imputation, matrix completion, MNAR, supervised learning with missing values) and has created a website on the topic (https://rmisstastic.netlify.app/) for users. Her research also focuses on causal inference techniques (causal inference with missing values, combining RCT and observational data) for personalized medicine. Julie Josse is dedicated to reproducible research with R statistical software: she has developed packages including FactoMineR and missMDA to transfer her work.