Thomas J. Fan

Thomas J. Fan

Senior Machine Learning Engineer at

    Thomas J. Fan is a Senior Machine Learning Engineer at and a maintainer for scikit-learn, an open-source machine learning library for Python. He led the development of scikit-learn's set_output API, which allows transformers to return pandas DataFrames. Previously, Thomas worked at Columbia University to improve interoperability between scikit-learn and AutoML systems. He also maintains skorch, a neural network library that wraps PyTorch. Thomas has a Master's in Mathematics from NYU and a Master's in Physics from Stony Brook University.

    All Sessions by Thomas J. Fan

    Day 2 04/24/2024
    12:10 pm - 12:40 pm

    Flyte: A Production-Ready Open Source AI Platform

    <span class="etn-schedule-location"> <span class="firstfocus">Machine Learning</span> </span>
    Day 1 04/23/2024
    2:00 pm - 4:30 pm

    Introduction to scikit-learn: Machine Learning in Python

    <span class="etn-schedule-location"> <span class="firstfocus">Machine Learning</span>

    Scikit-learn is a Python machine learning library used by data science practitioners from many disciplines. We start this training by learning about scikit-learn's API for supervised machine learning. scikit-learn's API mainly consists of three methods: fit to build models, predict to make predictions from models, and transform to modify data. This consistent and straightforward interface helps to abstract away the algorithm, thus allowing us to focus on our domain-specific problems. First, we learn the importance of splitting your data into train and test sets for model evaluation. Then, we explore the preprocessing techniques on numerical, categorical, and missing data. We see how different machine learning models are impacted by preprocessing. For example, linear and distance-based models require standardization, but tree-based models do not. We explore how to use the Pandas output API, which allows scikit-learn's transformers to output Pandas DataFrames! The Pandas output API enables us to connect the feature names with the state of a machine learning model. Next, we learn about the Pipeline, which connects transformers with a classifier or regressor to build a data flow where the output of one model is the input of another. Lastly, we look at scikit-learn's Histogram-based Gradient Boosting model, which can natively handle numerical and categorical data with missing values. After this training, you will have the foundations to apply scikit-learn to your machine learning problems.

    Open Data Science




    Open Data Science
    One Broadway
    Cambridge, MA 02142

    Privacy Settings
    We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
    Consent to display content from - Youtube
    Consent to display content from - Vimeo
    Google Maps
    Consent to display content from - Google