Observability
Three pillars
| Pillar | Tools (examples) | Use for |
|---|---|---|
| Metrics | Prometheus, Grafana, CloudWatch | Rates, saturation, RED/USE |
| Logs | Loki, ELK, CloudWatch Logs | Forensics, audit |
| Traces | Jaeger, Tempo, X-Ray | Latency across services |
SLO thinking (shows maturity)
- SLI — what you measure (availability, latency p99).
- SLO — target (99.9% over 30d).
- Error budget — how much failure you can afford before slowing releases.
- Alert on symptoms (user pain), not every CPU blip.
RED / USE cheatsheet
- RED (services): Rate, Errors, Duration.
- USE (resources): Utilization, Saturation, Errors.
Alert design
- Page humans for actionable, urgent issues.
- Runbooks linked from alert annotations.
- Avoid flapping:
for: 5m, grouping, inhibition rules.
Interview scenario
“Error rate spiked after deploy” — compare canary vs stable, check traces for downstream timeouts, roll back if no quick fix, postmortem with action items.
Questions to ask them
- On-call rotation and severity definitions?
- Who owns dashboards and SLO reviews?
- Log retention and PII handling?