How to Design and Build Resilient Machine Learning Systems

May 15, 2024 30 min Free

Description

The real world is messy. Systems fail, pipelines break, services go down, engineers push bugs, and users behave erratically. Software is hard exactly because these problems always happen. Effective systems must gracefully handle these events and smoothly degrade without catastrophic failure. Unfortunately, ML systems are more likely to break than bend. Just like a boxer who only punches a bag will fail in the ring, an ML model that only learns with clean data may fail in production. Most ML models are trained with clean data, and when failures occur feature distributions can shift in ways that the model has never seen during training. This can cause strange and unexpected behavior. In this talk we will explore how to build resilience into ML systems. We will discuss several types of production-specific risks and how these risks tend to manifest. These risks are common across many domains, but we will primarily use examples from our experience at Abnormal Security to demonstrate how we can detect, mitigate, and overcome these risks.

How to Design and Build Resilient Machine Learning Systems

Description

Up Next

Learning from Extremes: What Fraud-Fighting at Scale Can Teach Us About MLOps Across Domains

Building GenAI Infrastructure

Avoid ML OOps with ML Ops: A modular approach to scaling Forethought’s E2E ML Platform

Towards Robust GenAI: Techniques for Evaluating Enterprise LLM Applications

Evolution of ML Training and Serving Infrastructure @ Pinterest Ads

Hands-on Scalable Edge to Core ML Pipelines