Abstract: Scikit-learn is a Python machine-learning library used by data science practitioners from many disciplines. We will start this training by learning about scikit-learn's API for supervised machine learning. scikit-learn's API mainly consists of three methods: fit to build models, predict to make predictions from models, and transform to modify data. This consistent and straightforward interface abstracts away the underlying algorithm, thus enabling us to focus on our particular problems. We will learn the importance of splitting your data into train and test sets for model evaluation. Next, we will learn about combining preprocessing techniques with machine learning models using scikit-learn's Pipeline. The Pipeline allows us to connect transformers with a classifier or regressor to build a data flow where the output of one layer is the input of another. Finally, we will look at the Pandas output API recently introduced in version 1.2. After this training, you will have the foundations to apply scikit-learn to your machine-learning problems.
Module 1: Introduction to Supervised Learning with Scikit-learn
We introduce the workflow for supervised machine learning and how this workflow fits into scikit-learn's API. We will learn the data representation used by scikit-learn and experiment with different machine learning models.
Module 2: Preprocessing
Scikit-learn's transformer API enables us to preprocess or transfer the representation of our data. We will learn about the importance of preprocessing and how machine learning models behave with and without preprocessing.
Module 3: Pipelines
Scikit-learn's Pipeline greatly simplifies how we can express machine learning pipelines. In this section, we will learn about Pipelines to enable multiple preprocessing steps to be chained together and pass the transformed data into a machine-learning model.
Module 4: Pandas Output
With version 1.2, scikit-learn transformers are configurable to output pandas DataFrames! This section will explore the Pandas output API and how this feature is helpful in a machine-learning pipeline.
We recommend a basic understanding of Python for this workshop.
Bio: Thomas J. Fan is a Staff Software Engineer at Quansight Labs and is a maintainer for scikit-learn, an open-source machine learning library for Python. Previously, Thomas worked at Columbia University to improve interoperability between scikit-learn and AutoML systems. He is a maintainer for skorch, a neural network library that wraps PyTorch. Thomas has a Masters in Mathematics from NYU and a Masters in Physics from Stony Brook University.
Thomas J. Fan
Staff Software Engineer | Quansight Labs