Abstract: Modern statistics has become almost synonymous with machine learning, a collection of techniques that utilize today's incredible computing power. A combination of supervised learning (regression-like models) and unsupervised learning (clustering), the field is supported by theory, yet relies upon intelligent programming for implementation.
In this training session we will work through the entire process of training a machine learning model in R. Starting with the scaffolding of cross-validation, onto exploratory data analysis, feature engineering, model specification, parameter tuning and model selection. We then take the finished model and deploy it as an API in a Docker container for production use.
We will make extensive use the tidymodels framework of R packages.
Lesson 1: Preparing Data for the Modeling Process
The first step in a modeling project is setting up the evaluation loop in order to properly define a model's performance. To accomplish this we will learn the following tasks:
1. Load Data
1. Create train and test sets from the data using the rsample package
1. Create cross-validation set from the train set using the rsample package
1. Define model evaluation metrics such as RMSE and logloss using the yardstick package
Lesson 2: EDA and Feature Engineering
Before we can fit a model we must first understand the model by performing exploratory data analysis. After that we prepare the data through feature engineering, also called preprocessing and data munging. The primary steps we will learn include:
1. Perform summary EDA with dplyr
1. Visualize the data with ggplot2
1. Balance the data with the themis package
1. Impute or otherwise mark missing data with the recipes package
1. Perform data transformations with the recipes package
1. Numeric centering and scaling
1. Collapse noisy categorical data
1. Handle new categorical values
1. Convert categorical data into dummy (or indicator) variables
Lesson 3: Model Fitting and Parameter Tuning
Now we can begin fitting models. This involves defining the type of model, such as a penalized regression, random forest or boosted tree. This has been simplified thanks to the parsnip and workflows packages. Modern machine learning has essentially become an excercise in brute-forcing over tuning parameters, which we will do by combining the dials and tune package with the previously created cross-validation set.
1. Define the model structure with the parsnip package
1. Set tuning parameter candidates with the dials package
1. Iterate over the tuning parameter candidates using the tune package to perform cross-validation
1. Identify the best model fit with the yardstick package
Lesson 4: Deploying the Model into Production
After we build various machine learning models we need to make them accessible to others. We use the plumber package to expose our model as a REST API that can be hosted in a Docker container.
1. Make predictions using the workflows package
1. Convert the model to an API using the plumber package
1. Bundle the model object and API code into a Docker container
1. Serve that container and use curl to make perform predictions
Bio: Jared Lander is the Chief Data Scientist of Lander Analytics a data science consultancy based in New York City, the Organizer of the New York Open Statistical Programming Meetup and the New York R Conference and an Adjunct Professor of Statistics at Columbia University. With a masters from Columbia University in statistics and bachelors from Muhlenberg College in mathematics, he has experience in both academic research and industry. His work for both large and small organizations ranges from music and fundraising to finance and humanitarian relief efforts. He specializes in data management, multilevel models, machine learning, generalized linear models, data management and statistical computing. He is the author of R for Everyone: Advanced Analytics and Graphics, a book about R Programming geared toward Data Scientists and Non-Statisticians alike and is creating a course on glmnet with DataCamp.