Troubleshooting scenarios
Universal debug loop
- Detect — alert or user report; confirm impact.
- Triage — severity, comms channel, incident commander if needed.
- Mitigate — restore service (rollback, scale, bypass).
- Diagnose — find root cause with evidence.
- Fix — permanent change + prevent recurrence.
- Review — blameless postmortem, tracked action items.
Scenario: “Site is slow”
- Check golden signals (latency, traffic, errors, saturation).
- Recent deploy, config, traffic spike, dependency outage.
- Split: edge (CDN), app, DB, third-party API.
- Tools: APM traces, slow query log,
kubectl top, node disk.
Scenario: “Deploy failed halfway”
- CI logs vs runtime (image built but rollout stuck?).
kubectl rollout status, Helm history, Terraform plan diff.- Stuck PVC, admission webhook, resource quota.
Scenario: “Intermittent 502 from load balancer”
- Target health checks failing?
- Connection limits, TLS mismatch, timeout too aggressive.
- Compare healthy vs unhealthy target logs.
Communication template
Impact: … | Started: … | Current status: investigating/mitigated | Next update: 15 min | Workaround: …
What interviewers score
- Structured thinking, not random guessing.
- You mention data and rollback before deep rabbit holes.
- You close with prevention (test, monitor, automate).