Production incident drills
War-room drills for on-call practice — terminal alerts, live metrics, Slack-style updates, and timed decisions. Score MTTR pressure, blast radius, and communication. No login required.
-
Cascading 'Poison Pill' OOM-Kill Loop
HardA malformed data payload is serialized into a shared global cache. Every microservice pod that fetches it encounters an infinite memory-allocation bug during deserialization, triggering an instant kernel OOM-kill. As Kubernetes replaces the dead pods, they immediately pull the same key and die, paralyzing the cluster.
-
Cascading Cache Stampede
HardA high-traffic cache key expires, causing a massive wave of concurrent requests to slam the primary database. Manage database connection exhaustion and implement protective application patterns under load.
-
DNS resolution failure
EasyServices suddenly cannot resolve internal hostnames after a platform change. Trace the blast radius before rebooting everything.
-
Kubernetes CrashLoopBackOff
MediumA deployment rolls out and new pods never become ready. Diagnose whether it's config, resources, or upstream dependency.
-
Postgres replication lag
MediumRead traffic is hitting a stale replica while writes pile up on the primary. Decide how you triage without making lag worse.
-
TLS certificate expired
EasyMobile clients and webhooks fail TLS handshake after a cert lapsed on the edge. Decide renewal vs rollback under pressure.