Abstract: Every software system experiences incidents. Service outages, data pipeline delays, sudden load, and hundreds of other risks threaten system uptime and damage the user experience. Mature development teams plan for these incidents and build software that adapts to unexpected changes in system behavior and availability. Teams working on safety critical applications sometimes spend more time mitigating these risks than working on everything else put together.
Unfortunately, this kind of risk mitigation is notoriously difficult in a machine learning system. A small change to model inputs can cause large and unexpected changes to model outputs. As a result, software incidents that touch ML systems tend to have a larger blast radius and longer recovery times. This critical vulnerability has slowed the adoption of machine learning technologies in safety critical applications.
In order for machine learning to continue to drive impact in new applications we will need to address this problem directly. We start with testing. ML is software, and good tests are an irreplaceable tool for building a resilient system. We will explore how to design end-to-end simulations to assess our models' resilience.
Next, certain feature encoding strategies are more resilient to sudden distribution shifts than others. For example, models trained with clever default values or hashed bucketized features can be particularly resilient to localized feature outages. We will discuss the dynamics that drive this phenomenon.
Finally, an ML model is a reflection of the task we train it to solve. By cleverly introducing noise to the training process we can build models that perform well even during software incidents. We will dig into the best strategies to engineer tasks for robust and resilient models.
Bio: As the Head of Machine Learning at Abnormal Security, Dan builds cybercrime detection algorithms to keep people and businesses safe. Before joining Abnormal Dan worked at Twitter: first as an ML researcher working on recommendation systems, and then as the head of web ads machine learning. Before Twitter Dan built smartphone sensor algorithms at TrueMotion and Computer Vision systems at the Serre Lab.