Abstract: Gradient Boosting remains the most effective method for classification and regression problems on tabular data. This session is Part Two of two, covering advanced topics that are newer and may be less familiar. First, we will discuss how to calibrate the probabilities of classification models, reviewing the major techniques. Next, we will discuss Probabilistic Regression, wherein the goal is to predict the full probability distribution of the numerical target given the features, demonstrating different approaches to this problem. Finally, we will present tools for Conformal Prediction - a hot topic which can provide prediction intervals with strong theoretical guarantees.
Section 1: Probability Calibration. The output of a classification model does not always """"behave well"""" as a probability. This may be due to limitations of the model technique, overfitting, or model drift. We will compare and contrast various methods for post-hoc calibration, including Platt scaling, isotonic regression, beta calibration and spline calibration.
Section 2: Probabilistic Regression. Standard regression models output a simple point prediction. However, in many instance this fails to capture not only the range but the shape of the uncertainty around the prediction. We will demonstrate several approaches to Probabilistic Regression and compare and contrast their strengths and weaknesses.
Section 3: Conformal Prediction: The framework of conformal prediction (CP) provides strong theoretical guarantees for its prediction intervals and has become very popular in time series and other domains. We will discuss various """"flavors"""" of CP including split CP and Conformalized Quantile Regression (CQR) using the MAPIE package.
All examples will be in Python using Jupyter notebooks. Students should have experience with using Gradient Boosting models in practice, but all are welcome to follow along.
Bio: Brian Lucena is Principal at Numeristical, where he advises companies of all sizes on how to apply modern machine learning techniques to solve real-world problems with data. He is the creator of three Python packages: StructureBoost, ML-Insights, and SplineCalib. In previous roles he has served as Principal Data Scientist at Clover Health, Senior VP of Analytics at PCCI, and Chief Mathematician at Guardian Analytics. He has taught at numerous institutions including UC-Berkeley, Brown, USF, and the Metis Data Science Bootcamp.