Platform & cloud

Observability

Three pillars

Pillar	Tools (examples)	Use for
Metrics	Prometheus, Grafana, CloudWatch	Rates, saturation, RED/USE
Logs	Loki, ELK, CloudWatch Logs	Forensics, audit
Traces	Jaeger, Tempo, X-Ray	Latency across services

SLO thinking (shows maturity)

SLI — what you measure (availability, latency p99).
SLO — target (99.9% over 30d).
Error budget — how much failure you can afford before slowing releases.
Alert on symptoms (user pain), not every CPU blip.

RED / USE cheatsheet

RED (services): Rate, Errors, Duration.
USE (resources): Utilization, Saturation, Errors.

Alert design

Page humans for actionable, urgent issues.
Runbooks linked from alert annotations.
Avoid flapping: for: 5m, grouping, inhibition rules.

Interview scenario

“Error rate spiked after deploy” — compare canary vs stable, check traces for downstream timeouts, roll back if no quick fix, postmortem with action items.

Questions to ask them

On-call rotation and severity definitions?
Who owns dashboards and SLO reviews?
Log retention and PII handling?

← All topics Browse jobs