Supercharging Recommender Systems: Unleashing the Power of Distributed Model Training

May 15, 2024 22 min Free

Description

In this talk, Susrutha Gongalla, Principal Machine Learning Engineer at Stitch Fix, discusses how they evolved their model training architecture to utilize distributed model training. She starts with an overview of Stitch Fix and the motivation for this work, then dives into the technical details of data parallel distributed model training, and finally covers how this approach was productionized. The talk explores the challenges of handling large datasets and the evolution from single-GPU training to distributed data parallel (DDP) using PyTorch Lightning for scalability and efficiency.