Abstract: Scikit-learn is a machine learning library in Python that is used by many data science practitioners. Machine learning is a valuable tool used across many domains such as medicine, physics, and finance. We will start this training by learning about scikit-learn’s API for supervised machine learning. scikit-learn's API mainly consists of three methods: fit, to build models, predict, to make predictions from models, and transform, to change the representation of the input data. Supervised machine learning models in scikit-learn that make predictions are called classifiers or regressors. Models that are used for transforming data are called transformers. This simple and consistent interface helps to abstract away the algorithm, thus allowing us to focus on our particular problems. We will use this interface to apply traditional machine learning algorithms such as linear models and tree based models. We will learn about the importance of splitting your data into train and test sets for model evaluation. Next, we will learn about preprocessing numerical data and its importance when working with linear models. Linear models with regularization such as logistic regression can converge faster to a solution when the training data is scaled. Finally, we will learn how to combine these preprocessing techniques with a machine learning model by using a Pipeline. The Pipeline enables us to combine transformers with a classifier or regressor to build a data flow, where the output of one layer is the input or another. After this training, you will be able use and apply scikit-learn to your machine learning problems.
Bio: Thomas J. Fan is a Staff Associate at the Data Science Institute at Columbia University. He is one of the core developers of scikit-learn, an open-source machine learning library is written in Python. Thomas holds a Masters in Mathematics from NYU and Masters in Physics from Stony Brook University. He also maintains skorch, a scikit-learn compatible neural network library that wraps PyTorch. He believes that developing open-source software is one of the best ways to maximize one's impact.
Staff Associate - Machine Learning | Columbia University in the City of New York