How to Design and Build Resilient Machine Learning Systems

May 15, 2024 30 min Free

Description

The real world is messy. Systems fail, pipelines break, services go down, engineers push bugs, and users behave erratically. Software is hard exactly because these problems always happen. Effective systems must gracefully handle these events and smoothly degrade without catastrophic failure. Unfortunately, ML systems are more likely to break than bend. Just like a boxer who only punches a bag will fail in the ring, an ML model that only learns with clean data may fail in production. Most ML models are trained with clean data, and when failures occur feature distributions can shift in ways that the model has never seen during training. This can cause strange and unexpected behavior. In this talk we will explore how to build resilience into ML systems. We will discuss several types of production-specific risks and how these risks tend to manifest. These risks are common across many domains, but we will primarily use examples from our experience at Abnormal Security to demonstrate how we can detect, mitigate, and overcome these risks.