Missing Data in Supervised Machine Learning
Missing Data in Supervised Machine Learning


Most implementations of supervised machine learning algorithms are designed to work with complete datasets, but datasets are rarely complete. This dichotomy is usually addressed by either deleting points with missing elements and losing potentially valuable information or imputing (trying to guess the values of the missing elements), which can lead to increased bias and false conclusions. I will quickly review the three types of missing data (missing completely at random, missing at random, missing not at random) and a couple of simple but often misleading ways to impute. I will spend most of the time describing three advanced methods for handling missing data: multiple imputation, the reduced-feature (aka pattern submodel) approach, and XGBoost, which is one of the few machine learning algorithms that works with incomplete datasets. We will discuss the advantages and limitations of each of these methods. By the end of the workshop, you will have a deeper understanding of the intricate nature of missing data and you will have multiple state of the art techniques in your arsenal to deal with them. I will use python to implement and visualize the techniques, relying on packages like pandas, scikit-learn, xgboost, and matplotlib or plotly for visualizations. The workshop is at the intermediate level. I assume that this is not the first time the participants learn about missing data and that they have worked through at least one regression or classification problem before.

Session Outline
1. Describe the three main types of missingness patterns
2. Evaluate simple approaches for handling missing values
3. Apply three advanced techniques to handle missing values (XGBoost, multivariate imputation, and the reduced features model)
4. Decide which approach is best for your dataset

Background Knowledge
python, sklearn, general understanding of classical ML concepts and methods


Andras Zsom is a Lead Data Scientist in the Center for Computation and Visualization group at Brown University, Providence, RI. He works with high-level academic administrators to tackle predictive modeling problems, he collaborates with faculty members on data-intensive research projects, and he was the instructor of a data science course offered to the data science master students at Brown.

Open Data Science




Open Data Science
One Broadway
Cambridge, MA 02142

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Consent to display content from - Youtube
Consent to display content from - Vimeo
Google Maps
Consent to display content from - Google