
Abstract: Scikit-learn is a machine learning library in Python that is used by many data science practitioners. In this training, we will learn about cross-validation, tuning machine learning algorithms, and pandas interoperability. We will start by learning about cross-validation for machine learning. Cross-validation enables us to evaluate our machine learning models by splitting our data into training and testing datasets. We will cover cross-validation schemes such as K-Fold cross-validation and the importance of stratifying your data. Next, we will learn about tuning algorithms in scikit-learn with grid search and random search. These hyper-parameter searching techniques help find hyper-parameter combinations that are suited for your dataset. We will learn how to specific hyper-parameters spaces when working with scikit-learn's Pipelines. Next, we will learn about categorical features and how to use scikit-learn's encoders to convert these categorical features into numerical features for a machine-learning algorithm to consume. We will learn how to handle heterogeneous data with scikit-learn and panda's DataFrames. scikit-learn’s ColumnTransformer enables us to specify which columns in the DataFrame to apply a given transformer. Specifically, we will learn how to apply numerical transformations to numerical columns and encoders to the categorical columns. Then we will learn how to handle missing values with imputation using univariate techniques and a k-Nearest Neighbors approach. Finally, we will apply the machine learning techniques we have learned on a house pricing dataset with scikit-learn's Histogram-based Gradient Boosted Trees. scikit-learn’s implementation of boosted trees is based on LightGBM and has similar performance characteristics.
Bio: Thomas J. Fan is a Staff Associate at the Data Science Institute at Columbia University. He is one of the core developers of scikit-learn, an open-source machine learning library is written in Python. Thomas holds a Masters in Mathematics from NYU and Masters in Physics from Stony Brook University. He also maintains skorch, a scikit-learn compatible neural network library that wraps PyTorch. He believes that developing open-source software is one of the best ways to maximize one's impact.

Thomas Fan
Title
Staff Associate - Machine Learning | Columbia University in the City of New York
