Abstract: The values of a categorical variable frequently have a structure that is not ordinal or linear in nature. For example, the months of the year have a circular structure, and the US States have a geographical structure. Standard approaches such as one-hot or numerical encoding are unable to effectively exploit the structural information of such variables. In this tutorial, we will introduce the StructureBoost gradient boosting package, wherein the structure of categorical variables can be represented by a graph, and exploited to improve predictive performance. Morevoer, StructureBoost can make informed predictions on categorical values for which there is little or no data, by leveraging the knowledge of the structure. We will walk through examples of how to configure and train models using StructureBoost and demonstrate other features of the package.
Section 1: Structured Categorical Decision Trees. We will review how to extend the standard decision tree to accept structured categorical variables. This extension is the theoretical underpinning of StructureBoost.
Section 2: Configuring and using StructureBoost. Working through a Jupyter notebook, we will demonstrate how to fit and predict using StructureBoost on real datasets involving Structured Categorical Variables.
Section 3: Advanced features and capabilities of StructureBoost. We will dive deeper into some of the advantages of StructureBoost over other boosting packages, and demonstrate some of the more advanced features.
Attendees should be familiar with the Python toolkit: numpy, pandas, scikit-learn, etc.
Attendees should be familiar with the fit -> predict -> evaluate workflow of model creation using train/test splits. Experience with gradient boosting or random forests in particular will be helpful.
Bio: Brian Lucena is a Principal at Lucena Consulting and a consulting Data Scientist at Agentero. An applied mathematician in every sense, he is passionate about applying modern machine learning techniques to understand the world and act upon it. In previous roles, he has served as SVP of Analytics at PCCI, Principal Data Scientist at Clover Health, and Chief Mathematician at Guardian Analytics. He has taught at numerous institutions including UC-Berkeley, Brown, USF, and the Metis Data Science Bootcamp.