Enabling HPC & ML Workloads with the Latest Kubernetes Job Features

This talk presents new features in the Kubernetes Job API and demonstrates their application in running distributed Batch/AI/HPC workloads at scale. It draws on real-world experiences from DeepMind and the Flux Operator from Lawrence Livermore National Laboratory. The presentation showcases the Indexed Jobs feature for parallel workloads requiring pod-to-pod communication, including distributed machine learning examples used by DeepMind. It also covers the orchestration of HPC workloads using the Flux Operator, creating "Mini Clusters" within Kubernetes built on indexed jobs for managing batch workloads and APIs. Additionally, the talk addresses handling pod failures for long-running workloads using Pod Failure Policy to maintain execution despite disruptions while reducing costs from unnecessary retries due to software bugs.

Enabling HPC & ML Workloads with the Latest Kubernetes Job Features

Description