Abstract: Build your first Machine Learning models in Python using scikit-learn. Become well-versed in the entire scikit-learn suite to fit models, score models, make predictions from models, and fine-tune models. Many algorithms including Linear Regression, Logistic Regression, Decision Trees, Random Forests, and XGBoost are sampled. Use train_test_split, cross_val_score, GridSearchCV, RandomizedSearchCV, and feature_importances_ to improve your life. A brief introduction to pandas is included to load, visualize, and prepare your data.
All code is presented in Python via Jupyter Notebooks on GitHub. Datasets are provided, but attendees are encouraged to bring their own CSV files for real-world practice. The only prerequisite is proficiency in Python. If you have never built a Machine Learning model, this is an excellent place to start. By the end of this workshop you will confidently build, score, fine-tune, and make predictions from Machine Learning models in scikit-learn.
Module 1: Preparing data for Machine Learning with pandas
It's essential to load data properly to build successful Machine Learning models in scikit-learn. We cover loading data into pandas DataFrames, clearing null values, transforming categorical columns into numerical columns, and choosing target and predictor columns.
Module 2: Supervised learning with scikit-learn
Scikit-learn's API for splitting data, training models, scoring models, and making predictions is very user-friendly. Scikit-learn includes hundreds of Machine Learning algorithms to choose from. We sample Linear Regression, Logistic Regression, Decision Trees, Random Forests, and XGBoost, as you build your first Machine Learning models.
Module 3: Cross-Validation with scikit-learn
Overfitting data often occurs with strong Machine Learning models. Scikit-learn provides excellent cross-validation options to split your data into multiple training and test sets. We cover essential cross-validation practices such as K-Fold cross-validation and stratifying your data.
Module 4: Fine-tuning models with scikit-learn
Optimizing Machine Learning models requires understanding the ranges of hyperparameters, and finding the best possible combinations. Scikit-learn includes powerful modules for full grid searches and random searches to find unique combinations best suited to your data. We focus on fine-tuning tree ensembles with XGBoost.
Module 5: Finding the Most Influential Columns with scikit-learn
Many businesses want to know the most influential attributes (columns) in predicting certain outcomes. Several scikit-learn algorithms including Random Forests and XGBoost include an attribute called feature_importances_ that ranks all columns by their numerical influence in making predictions. This is a great bonus that scikit-learn provides.
Bio: Corey Wade, MS Mathematics, MFA Writing & Consciousness, is the director and founder of Berkeley Coding Academy, an online program with live classes where teenagers learn Python Programming, Data Analytics, and Machine Learning. Author of Hands-on Gradient Boosting with XGBoost and scikit-learn, and lead author of The Python Workshop, Corey also teaches Math, Programming, and Data Science at Berkeley Independent Study. Corey has published iPhone apps with students, designed classes to build websites, and run after-school coding programs to support girls and underserved students. A Springboard Data Science graduate and multiple grant award-winner, Corey has also worked in industry developing Data Science curricula for Pathstream and Hello World while contributing articles for Towards Data Science. When not coding or teaching, Corey reads poetry and studies the stars.