Evolution of ML Training and Serving Infrastructure @ Pinterest Ads

May 15, 2024 27 min Free

Description

This talk details the evolution of training and serving infrastructure at Pinterest Ads over the past 5+ years. It covers the transition from logistic regression-based models to large transformer-based models served efficiently with GPU technology, highlighting challenges and lessons learned. The presentation discusses the architectural changes made to scale machine learning systems, including improvements in feature stores, model versioning with MLflow, unified training tables, and real-time monitoring. It also explores the shift towards PyTorch-based solutions and the development of templated ML workflows to enhance developer velocity and experimentation, with a recent investment in a Ray-based architecture for faster data set iteration.