Tales from on-Call: Fun with Operating Etcd at Scale
May 01, 2023
32 min
Free
etcd
kubernetes
on-call
cluster-management
distributed-systems
cloud-native
eks
storage-quota
revision-divergence
memory-pressure
Description
This presentation shares insights and challenges from operating Etcd at scale, focusing on issues encountered by the Amazon EKS Etcd team. The speakers discuss solutions for common problems such as Etcd out-of-memory conditions, managing storage quotas, detecting and recovering from revision divergence, and handling timeouts during operations like defragmentation. They also touch upon strategies for dealing with large request sizes and the importance of understanding Etcd's internal mechanisms for maintaining cluster stability and performance.