The Day We Delete(d) Production

May 01, 2023 36 min Free

Description

This session recounts a critical incident at CERN where a maintenance tool accidentally deleted a third of their production Kubernetes capacity. The speakers detail how the organization managed to recover with minimal downtime, focusing on the architecture that ensured high service availability, strategies to minimize blast radius, the 'clusters as cattle' philosophy, and the crucial role of GitOps in saving the day. They also share lessons learned, including dealing with cyclic dependencies and the need for careful handling of stateful workloads and multi-cluster scheduling.