Core skills

Troubleshooting scenarios

Universal debug loop

Detect — alert or user report; confirm impact.
Triage — severity, comms channel, incident commander if needed.
Mitigate — restore service (rollback, scale, bypass).
Diagnose — find root cause with evidence.
Fix — permanent change + prevent recurrence.
Review — blameless postmortem, tracked action items.

Scenario: “Site is slow”

Check golden signals (latency, traffic, errors, saturation).
Recent deploy, config, traffic spike, dependency outage.
Split: edge (CDN), app, DB, third-party API.
Tools: APM traces, slow query log, kubectl top, node disk.

Scenario: “Deploy failed halfway”

CI logs vs runtime (image built but rollout stuck?).
kubectl rollout status, Helm history, Terraform plan diff.
Stuck PVC, admission webhook, resource quota.

Scenario: “Intermittent 502 from load balancer”

Target health checks failing?
Connection limits, TLS mismatch, timeout too aggressive.
Compare healthy vs unhealthy target logs.

Communication template

Impact: … | Started: … | Current status: investigating/mitigated | Next update: 15 min | Workaround: …

What interviewers score

Structured thinking, not random guessing.
You mention data and rollback before deep rabbit holes.
You close with prevention (test, monitor, automate).

← All topics Browse jobs