Finding training inefficiencies with CentML DeepView

May 16, 2024 41 min Free

Description

Performance bottlenecks and resource underutilization are common issues for deep learning researchers and developers, slowing down workflows and wasting computational resources. The current ecosystem of DL profilers often lacks a developer-friendly approach to understanding training performance or methods to decrease underutilization and enhance performance.

This presentation showcases DeepView, an open-source visual profiler from CentML tailored for ML developers. DeepView provides intuitive performance visualizations and offers hints for optimizing training jobs for efficiency. It also optimizes deployment targets based on performance predictions to meet budget and time constraints, seamlessly integrating with PyTorch and VS Code. The talk includes an interactive demo demonstrating how to optimize a real model's training with DeepView, achieving significant increases in training throughput.