
Abstract: Scikit-learn is a Python machine-learning library used by data science practitioners from many disciplines. We will learn about Pandas interoperability, categorical data, parameter tuning, and model evaluation. For Pandas interoperability, the ColumnTransformer applies data transformations on different columns from a Pandas DataFrame. In version 1.2, all of scikit-learn's transformers are configurable to output Pandas DataFrames. Next, we will learn about categorical data and how to use scikit-learn's encoders to convert these categorical features into numerical features for a machine learning algorithm to consume. We will explore tuning algorithms in scikit-learn with grid search and random search. Model evaluation is an essential part of the machine learning workflow. We will cover the metrics provided by scikit-learn and how to use the scoring API. Furthermore, we will use the plotting API to visualize a model's performance. Finally, we use all the ML techniques we learned to train and evaluate a model on a house pricing dataset with Histogram-based Gradient Boosted Trees.
Session Outline:
Module 1: Pandas Interoperability
Pandas DataFrames are frequently used together with scikit-learn to build machine learning models. In this section, we learn how to use the ColumnTransformer to apply transformers on different columns of a Pandas Dataframe. We will learn how to configure scikit-learn's transformers to output Pandas DataFrames.
Module 2: Categorical data
Many datasets are heterogeneous, consisting of both numerical features and categories features. In this section, we learn how to use scikit-learn's encoders to convert categorical features into numerical features for a machine-learning model to consume.
Module 3: Parameter Tuning
Scikit-learn's machine learning models contain a wide selection of hyper-parameters. This section explores the available tuning algorithms to select the hyper-parameters for your data and machine learning pipelines.
Module 4: Model Evaluation
Model evaluation is an essential part of the machine learning workflow. In this section, we will cover a selection of metrics provided by scikit-learn and how to use the scoring and plotting API for model evaluation.
Background Knowledge:
We recommend a basic understanding of Python and scikit-learn for this workshop.
Bio: Thomas J. Fan is a Staff Software Engineer at Quansight Labs and is a maintainer for scikit-learn, an open-source machine learning library for Python. Previously, Thomas worked at Columbia University to improve interoperability between scikit-learn and AutoML systems. He is a maintainer for skorch, a neural network library that wraps PyTorch. Thomas has a Masters in Mathematics from NYU and a Masters in Physics from Stony Brook University.

Thomas J. Fan
Title
Staff Software Engineer | Quansight Labs
