Evolution of ML Training and Serving Infrastructure @ Pinterest Ads
May 15, 2024
27 min
Free
training-infrastructure
serving-infrastructure
pinterest
ads-ranking
gpu-serving
machine-learning
ml-infrastructure
transformer-models
feature-engineering
data-processing
model-monitoring
mlops
Description
This talk details the evolution of training and serving infrastructure at Pinterest Ads over the past 5+ years. It covers the transition from logistic regression-based models to large transformer-based models served efficiently with GPU technology, highlighting challenges and lessons learned. The presentation discusses the architectural changes made to scale machine learning systems, including improvements in feature stores, model versioning with MLflow, unified training tables, and real-time monitoring. It also explores the shift towards PyTorch-based solutions and the development of templated ML workflows to enhance developer velocity and experimentation, with a recent investment in a Ray-based architecture for faster data set iteration.