Platform & cloud

Observability

Three pillars

PillarTools (examples)Use for
MetricsPrometheus, Grafana, CloudWatchRates, saturation, RED/USE
LogsLoki, ELK, CloudWatch LogsForensics, audit
TracesJaeger, Tempo, X-RayLatency across services

SLO thinking (shows maturity)

  • SLI — what you measure (availability, latency p99).
  • SLO — target (99.9% over 30d).
  • Error budget — how much failure you can afford before slowing releases.
  • Alert on symptoms (user pain), not every CPU blip.

RED / USE cheatsheet

  • RED (services): Rate, Errors, Duration.
  • USE (resources): Utilization, Saturation, Errors.

Alert design

  • Page humans for actionable, urgent issues.
  • Runbooks linked from alert annotations.
  • Avoid flapping: for: 5m, grouping, inhibition rules.

Interview scenario

“Error rate spiked after deploy” — compare canary vs stable, check traces for downstream timeouts, roll back if no quick fix, postmortem with action items.

Questions to ask them

  • On-call rotation and severity definitions?
  • Who owns dashboards and SLO reviews?
  • Log retention and PII handling?

← All topics Browse jobs