Abstract: As the amount and complexity of data rapidly increases, machine learning tools are being used for a wide array of analytical tasks. These tasks include supervised and unsupervised prediction and forecasting as well as sophisticated normalization and integration of heterogeneous data sets. Although machine learning has shown great promise in almost every area it has been applied to, mistaken assumptions about the data being used to train such models can lead to erroneous evaluations and to models that do not actually work as well (or at all) in practice. In this session, we will talk concretely about five interrelated pitfalls that one might encounter when using supervised machine learning and how to avoid them. Importantly, these pitfalls are not domain specific --- they can, and do, occur in every industry, and failing to appreciate their significance can cause projects to fail that would otherwise succeed.
This session will cover five statistical pitfalls:
1. Distributional differences
2. Dependency structure
3. Confounding variables
4. Information leakage
5. Unbalanced data
Each pitfall will have an example, although the first and fourth pitfalls will be discussed the most in-depth. By the end, the audience should have a conceptual understanding of what each of these pitfalls are and how to avoid them.
The audience should understand how machine learning models are trained, i.e. using a training set for training and a separate test set for evaluation, but do not need to know the mathematics behind how any models work. One may get more out of the talk if they have trained a model themself, but that is not a requirement.
Bio: Jacob Schreiber is a post-doctoral researcher at the Stanford School of Medicine. As a researcher, he has developed machine learning approaches to integrate thousands of genomics data sets, to design biological sequences with desired characteristics, and has described how statistical pitfalls can be encountered and accounted for in genomics data sets. As an engineer, he has contributed to the community as a core contributor to scikit-learn and as the developer of several machine learning toolkits, including pomegranate for probabilistic modeling and apricot for submodular optimization.