The Day We Delete(d) Production
May 01, 2023
36 min
Free
kubernetes
gitops
production-incidents
disaster-recovery
cloud-native
cluster-management
ci-cd
argo-cd
argo-workflows
automation
observability
Description
This session recounts a critical incident at CERN where a maintenance tool accidentally deleted a third of their production Kubernetes capacity. The speakers detail how the organization managed to recover with minimal downtime, focusing on the architecture that ensured high service availability, strategies to minimize blast radius, the 'clusters as cattle' philosophy, and the crucial role of GitOps in saving the day. They also share lessons learned, including dealing with cyclic dependencies and the need for careful handling of stateful workloads and multi-cluster scheduling.