
Abstract: The real world is a constant source of ever-changing and non-stationary data. That ultimately means that even the best ML models will eventually go stale. Data distribution shifts, in all of their forms, are one of the major post-production concerns for any ML/data practitioner. As organizations are increasingly relying on ML to improve performance as intended outside of the lab, the need for efficient debugging and troubleshooting tools in the ML operations world also increases. That becomes especially challenging when taking into consideration common requirements in the production environment, such as scalability, privacy, security, and real-time concerns.
Distribution shift issues, if unaddressed, can mean significant performance degradation over time and even turn the model downright unusable. How can teams proactively assess these issues in their production environment before their models degrade significantly? To answer this question, traditional statistical methods and efficient data logging techniques must be combined into practical tools in order to enable distribution shift inspection and detection under the strict requirements a production environment can entail.
In this talk, Data Scientist Felipe Adachi will talk about different types of data distribution shifts in ML applications, such as covariate shift, label shift, and concept drift, and how these issues can affect your ML application. Furthermore, the speaker will discuss the challenges of enabling distribution shift detection in data in a lightweight and scalable manner by calculating approximate statistics for drift measurements. Finally, the speaker will walk through steps that data scientists and ML engineers can take in order to surface data distribution shift issues in a practical manner, rather than reacting to the impacts of performance degradation reported by their customers.
Session Outline:
For this 90-minute hands-on workshop, the following session outline is planned.
Session 1 - Data Distribution Shift
In this session, we’ll introduce the concept of data distribution shifts, and exactly why this is a problem for ML practitioners. We will cover different types of distribution shifts and how to measure them.
In this session, we will cover:
1. Data Distribution Shift
a. What is Data Distribution Shift?
b. Why is it a problem?
2. Types of Distribution Shift (With Definitions and Examples)
a. Covariate Shift / Concept Drift / Label Shift
3. How to Measure Drift
a. Visual Inspection / Validation / Statistical Tests
4. Notebook Hands-on: Detecting distribution shift with popular statistical packages
(scipy/alibi-detect)
Session 2 - Facing the Real World
In the real world, we might not always have data readily available as we would like. In this session, we’ll cover several challenges presented by the real world, and how we can leverage data logging to help us overcome these challenges.
In this session, we will cover:
1. Challenges of the real world
a. Big Data/Privacy/Streaming & Distributed Systems
2. Data Logging
a. Principles of whylogs
i. Efficient / Customizable /Mergeable
3. Notebook Hand-on: Profiling data and inspecting results with whylogs Session 3 Inspecting and Comparing Distributions with whylogs
In this session, we will explore the whylogs’ Visualizer Module and its capabilities, using the Wine Quality dataset as a use-case to demonstrate distribution shifts. We will first generate statistical summaries with whylogs and then visualize the profiles with the visualization module.
This is a Hands-on Notebook Session.
In this session, we will cover:
1. Notebook Hands-on with whylogs’ Visualizer Module
a. Introduction to the Visualizer module
b. Profiling data with whylogs
c. Generating Summary Drift Reports
d. Inspecting Distribution Charts between distributions
e. Inspecting Histograms between distributions
f. Inspecting Feature Statistics
Session 4 - Data Validation
As discussed in previous sessions, data validation plays a critical role in detecting changes in your data. In this session, we will introduce the concept of constraints - ways to express your expectations from your data - and how to apply them to ensure the quality of your data.
This is a Hands-on notebook session.
In this session, we will cover:
1. Introduction to Constraints
2. Defining Data Constraints
3. Applying defined constraints to data
4. Generating Data Validation Reports
Bio: Felipe is a Data Scientist in WhyLabs. He is a core contributor to whylogs, an open-source data logging library, and focuses on writing technical content and expanding the whylogs library in order to make AI more accessible, robust, and responsible. Previously, Felipe was an AI Researcher at WEG, where he researched and deployed Natural Language Processing approaches to extract knowledge from textual information about electric machinery. He is also a Master in Electronic Systems Engineering from UFSC (Universidade Federal de Santa Catarina), with research focused on developing and deploying fault detection strategies based on machine learning for unmanned underwater vehicles. Felipe has published a series of blog articles about MLOps, Monitoring, and Natural Language Processing in publications such as Towards Data Science, Analytics Vidhya, and Google Cloud Community.