Abstract: When faced with a new dataset for a Machine Learning task, there are common questions that every data scientist will ask themselves about the data, and common preprocessing and cleaning operations to be performed. This can be laborious and time consuming using pandas alone, and looking at all columns and their interactions can become infeasible for larger datasets. In this talk, we'll see how the dataset on-boarding process for machine learning can be greatly simplified by using the dabl library in Python, which provides interactive suggestions for data cleaning and visualization. Using selection techniques specifically tailored to supervised learning, dabl will detect data types, apply required cleaning and preprocessing and select relevant visualizations automatically. This allows speeding up the on-boarding of new datasets tremendously and provides immediate feedback and insights.
Bio: Andreas Mueller is a Principal Research SDE at Microsoft (previously Columbia, NYU, Amazon), and author of the O'Reilly book "Introduction to machine learning with Python", describing a practical approach to machine learning with python and scikit-learn. He is one of the core developers of the scikit-learn machine learning library, and has been co-maintaining it for several years. Andreas is also a Software Carpentry instructor.