
Abstract: It’s practically dogma today that a model's best day in production will be its first day in production. Over time model performance degrades, and thare are many variables that can cause decay, from real-world behavior changes to data drifts.
When models misbehave, we often turn to retraining to fix the problem, but is the most recent data the best data to resolve our model performance issues and get it back on track? We all acknowledge the need for data-driven machine learning monitoring that pinpoints anomalies and uncovers their root cause so we can resolve issues quickly before they impact the business. When it comes to resolution through retraining, data selection and the retraining strategy selected are less than data-driven. Today when faced with retraining, many data teams simply select the last month or two of data to retrain on and hope that fresh really is best.
In this talk, we'll showcase, through ML monitoring and notebooks, how data scientists and ML engineers can leverage ML monitoring to find the best data and retraining strategy mix to resolve machine learning performance issues. This data-driven, production-first approach enables more thoughtful retraining selections, shorter and leaner retraining cycles, and can be integrated into MLOps CI/CD pipelines for continuous model retraining upon anomaly detection.
Session Outline:
What you will learn from this talk:
- Retraining groups and temporal similarity
- Drifted features and pre-preprocessing
- Drifted segments and model split
- Pipeline anomaly exclusion.
Public repo and notebook will be provided to attendees so they can leverage production-first retraining in their machine learning monitoring.
Bio: Oryan is a ֿLead Software Engineer with a passion for Machine Learning and DevOps, with 7 years of experience developing services for production and development environments and leading teams.