Editor’s note: Andras is a speaker for ODSC West 2019! this October 29 to November 1! Be sure to check out his talk, “Missing data in supervised machine learning” there.
Datasets are almost never complete and this can introduce various biases to your analysis. Due to these biases, your supervised machine learning model can produce incorrect predictions. The goal of this post is to give you an idea of why some of the most common approaches for dealing with missing values often introduce some type of bias. At ODSC West 2019, I will describe the methods and techniques that can help you to arrive at an unbiased conclusion in the face of missing data.
[Related article: Handling Missing Data in Python/Pandas]
Why is my data missing?
There are three mechanisms for missing data: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). MCAR is the best scenario you can have. It means that the values are missing non-systematically and the data sample is still representative of the underlying distribution or population. For example, some people randomly fail to fill in some values entirely in a survey. MAR means that the reason why the variable is missing is not related to its underlying value but it is conditional on other variables. For example, missing values in blood pressure data can be conditional on age. Older people are more likely to have their blood pressure measured during a regular check-up than younger people. This has nothing to do with the value of their blood pressures. The missingness pattern is conditional on another variable, namely the age of patients. Finally, MNAR is the most severe case. MNAR means that the reason the variable is missing is related to the value of the variable itself. For example, depressed people are less likely to fill out a survey on depression because of their level of depression. MCAR is the only type of pattern that can be verified statistically , there are no tests to verify MAR and MNAR to the best of my knowledge.
What can I do if I find myself dealing with an incomplete dataset?
One of the simplest approaches is to just delete the data points with missing values. If the data are MCAR, deletion does not add any bias but can lower the confidence of your machine learning model because the sample size is reduced. If the data are not MCAR, deletion will bias the outcome of the model; the predictions will be systematically off. The reason is that the remaining sample (the complete cases) is not representative of the underlying population.
Another option is to impute the missing values. Imputation means that predicted or representative values are filled in place of the missing data. Most commonly, mean or median imputation is performed such that the missing values in a variable are replaced with the mean or median of the complete values. If the data are MCAR, imputation will produce an overconfident model because the variance of the variables are artificially reduced by imputation. If the data are not MCAR, mean/median imputation can bias the results because the mean/median of the complete values can be significantly different from the mean/median of the underlying population.
The above methods don’t sound great — can I do better?
Yes, join my session at ODSC West 2019! I will describe multivariate imputation using the MICE method and its limitations. I will showcase the small number of supervised machine learning tools that work with NaNs and can give an unbiased estimate even in the case of MNAR. And I will describe the reduced-features model  (also called the pattern submodel approach ) which is robust to MNAR as well but may not fit well under certain conditions.
[Related article: Unsupervised Learning: Evaluating Clusters]
There is, unfortunately, no single best solution to missing data in supervised machine learning. However, if you are willing to put in some effort, you can remove biases from your model and improve the overall quality of your predictions.
 Little, R. J. A. (1988). A test of missing completely at random for multivariate data with missing values. Journal of the American Statistical Association, 83(404), 1198–1202.
 Saar-Tsechansky, M., & Provost, F. (2007). Handling missing values when applying classification models. Journal of machine learning research, 8(Jul), 1623-1657.
 Fletcher Mercaldo, S., & Blume, J. D. (2018). Missing data and prediction: the pattern submodel. Biostatistics.