Core skills

Troubleshooting scenarios

Universal debug loop

  1. Detect — alert or user report; confirm impact.
  2. Triage — severity, comms channel, incident commander if needed.
  3. Mitigate — restore service (rollback, scale, bypass).
  4. Diagnose — find root cause with evidence.
  5. Fix — permanent change + prevent recurrence.
  6. Review — blameless postmortem, tracked action items.

Scenario: “Site is slow”

  • Check golden signals (latency, traffic, errors, saturation).
  • Recent deploy, config, traffic spike, dependency outage.
  • Split: edge (CDN), app, DB, third-party API.
  • Tools: APM traces, slow query log, kubectl top, node disk.

Scenario: “Deploy failed halfway”

  • CI logs vs runtime (image built but rollout stuck?).
  • kubectl rollout status, Helm history, Terraform plan diff.
  • Stuck PVC, admission webhook, resource quota.

Scenario: “Intermittent 502 from load balancer”

  • Target health checks failing?
  • Connection limits, TLS mismatch, timeout too aggressive.
  • Compare healthy vs unhealthy target logs.

Communication template

Impact: … | Started: … | Current status: investigating/mitigated | Next update: 15 min | Workaround: …

What interviewers score

  • Structured thinking, not random guessing.
  • You mention data and rollback before deep rabbit holes.
  • You close with prevention (test, monitor, automate).

← All topics Browse jobs