Intermediate Machine Learning with Scikit-learn: Cross-validation, Parameter Tuning, Pandas Interoperability, and Missing Values

Abstract: 

Scikit-learn is a Python machine learning library used by data science practitioners from many disciplines. We will learn about cross-validation, tuning machine learning algorithms, and pandas interoperability during this training. Cross-validation enables us to evaluate our machine learning models by splitting our data into multiple training and testing datasets. We will learn to handle missing values with imputation using univariate and multivariate techniques. Next, we will explore tuning algorithms in scikit-learn with grid search and random search. We will learn about categorical features and how to use scikit-learn's encoders to convert these categorical features into numerical features for a machine-learning algorithm to consume. Finally, we will apply the machine learning techniques on a house pricing dataset with scikit-learn's Histogram-based Gradient Boosted Trees. scikit-learn's boosted tree implementation is based on LightGBM and has similar performance characteristics.

Session Outline
Module 0: Quick Review of scikit-learn
We start by reviewing scikit-learn's API for fitting and evaluating models. We revisit splitting data into training and test sets, preprocessing data, and pipelines.

Module 1: Cross-Validation in scikit-learn
Cross-validation enables us to evaluate our machine learning models by splitting our data into multiple training and testing datasets. We cover cross-validation schemes such as K-Fold cross-validation and the importance of stratifying your data.

Module 2: Parameter tuning
Next, we learn about tuning algorithms in scikit-learn with grid search, random search, and successive halving. These hyper-parameter searching techniques help find configurations that are suited for your data. We explore how to specify hyper-parameters spaces when working with scikit-learn's Pipelines.

Module 3: Missing values in scikit-learn
We learn to handle missing values with imputation using univariate techniques and multivariate techniques such as k-Nearest Neighbors. We cover missing value indicators and use parameter tuning to find the best imputer for your data.

Module 4: Pandas Interoperability
The ColumnTransformer enables us to specify which columns in the panda's DataFrame to apply to a given transformer. Specifically, we use numerical transformations on numerical columns and encoders on categorical columns.

Background Knowledge
Python

Bio: 

Thomas J. Fan is a Senior Software Engineer at Quansight Labs, working to sustain and evolve the PyData open-source ecosystem. He is a maintainer for scikit-learn, an open-source machine learning library written for Python. Previously, he worked at Columbia University, improving the interoperability between scikit-learn and AutoML systems. Thomas holds a Masters in Physics from Stony Brook University and a Masters in Mathematics from New York University.

Open Data Science

 

 

 

Open Data Science
One Broadway
Cambridge, MA 02142
info@odsc.com

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Youtube
Consent to display content from - Youtube
Vimeo
Consent to display content from - Vimeo
Google Maps
Consent to display content from - Google