Supercharging Recommender Systems: Unleashing the Power of Distributed Model Training
May 15, 2024
22 min
Free
recommender-systems
distributed-training
data-parallelism
stitch-fix
machine-learning
pytorch
gpu-computing
model-training
deep-learning
scalability
python
Description
In this talk, Susrutha Gongalla, Principal Machine Learning Engineer at Stitch Fix, discusses how they evolved their model training architecture to utilize distributed model training. She starts with an overview of Stitch Fix and the motivation for this work, then dives into the technical details of data parallel distributed model training, and finally covers how this approach was productionized. The talk explores the challenges of handling large datasets and the evolution from single-GPU training to distributed data parallel (DDP) using PyTorch Lightning for scalability and efficiency.