Abstract: Class imbalance is extremely common in real-world datasets. Yet so many data scientists use machine learning algorithms without considering this challenge. The presence of class imbalance violates key assumptions of many algorithms, which leads to flawed models. This talk will:
- Explain why such imbalances occur, drawing on research into the theory of class imbalance, class overlap, and noise.
- Describe error metrics that can expose or mitigate class imbalance (including ROC curves, precision, and recall).
- Discuss some solutions including pre-processing techniques (over- and under-sampling, weighting), modifications to existing algorithms, and post-processing techniques (threshold selection, cost-based classification).
You’ll leave this presentation with the ability to recognize class imbalance and the confidence to overcome it.
Bio: Samuel Taylor is a Data Scientist who has worked applying machine learning to such problems as ad targeting, music recommendation, and audience size prediction. With a background in data and software engineering, he has also provided technical leadership on data warehousing, ETL, and business automation efforts. Outside of work, he helps high school students learn to code and tries to teach computers sign language.