Kubernetes Production Readiness Checklist for 2026
Kubernetes Production Readiness Checklist
Running Kubernetes in dev is easy. Running it reliably in production requires deliberate engineering. This checklist covers what teams commonly miss before going live.
Workload Configuration
Resource Management
- CPU and memory requests set on every container
- Memory limits set (prevents OOM killing neighbors)
- CPU limits set only if you understand throttling trade-offs
- Resource quotas per namespace
- LimitRanges to enforce defaults
Health Checks
- Readiness probes on every service (controls traffic routing)
- Liveness probes with conservative thresholds (prevents restart storms)
- Startup probes for slow-starting applications
- Probes test actual functionality, not just TCP port
Pod Configuration
- Run as non-root user
- Read-only root filesystem where possible
- No privilege escalation (
allowPrivilegeEscalation: false) - Explicit
securityContexton every pod - Anti-affinity rules for critical services (spread across nodes)
Networking
Services & Ingress
- Internal services use ClusterIP (not NodePort)
- Ingress controller with TLS termination
- Certificate auto-renewal (cert-manager)
- Rate limiting at ingress level
- Connection timeouts configured
Network Policies
- Default deny ingress in sensitive namespaces
- Explicit allow rules for known traffic paths
- Egress restrictions for workloads that shouldn’t reach the internet
- Verified with a CNI that enforces policies (Calico, Cilium)
Observability
Metrics
- Prometheus scraping all workloads
- RED metrics per service (Rate, Errors, Duration)
- Node-level metrics (node-exporter)
- kube-state-metrics deployed
- Grafana dashboards for critical services
Logging
- Structured logging (JSON) from applications
- Centralized log aggregation (Loki, EFK, CloudWatch)
- Log retention policy defined
- No sensitive data in logs
Alerting
- SLO-based alerts (not just threshold alerts)
- On-call rotation configured
- Runbooks linked to every alert
- Alert fatigue reviewed monthly
Security
RBAC
- No cluster-admin for workloads
- Service accounts per workload (not default)
- Namespace-scoped roles preferred over ClusterRoles
- Regular audit of permissions
Secrets
- External secret management (Vault, AWS SM, GCP SM)
- No secrets in environment variables visible in pod spec
- Secret rotation strategy documented
- Encryption at rest enabled for etcd
Image Security
- Images from trusted registries only
- Image scanning in CI pipeline (Trivy, Snyk)
- No
latesttag in production - Image pull policy:
IfNotPresentor digest-based
Reliability
High Availability
- Multiple replicas for stateless services (min 2, ideally 3)
- PodDisruptionBudgets on critical services
- Multi-AZ node pools
- Control plane HA (managed K8s handles this)
- etcd backup strategy (self-managed clusters)
Autoscaling
- HPA configured based on relevant metrics
- Cluster Autoscaler or Karpenter for node scaling
- Scale-down cooldown configured
- Load tested to verify scaling behavior
Disaster Recovery
- Cluster rebuild documented and tested
- Persistent volume backup strategy (Velero)
- GitOps repo is the source of truth
- Recovery time objective (RTO) defined
Deployment
Rollout Strategy
- Rolling updates with proper maxSurge/maxUnavailable
- Rollback procedure tested
- Canary or blue/green for critical services
- Deployment blocked if tests fail
GitOps
- All manifests in version control
- ArgoCD/Flux for continuous deployment
- Drift detection enabled
- Manual changes to cluster are flagged
Maintenance
Upgrades
- Kubernetes version within supported window
- Upgrade tested in staging first
- Node rotation strategy (cordon, drain, replace)
- Version skew policy understood
Cost
- Unused resources identified (Kubecost, cloud cost explorer)
- Right-sizing recommendations reviewed monthly
- Spot/preemptible nodes for fault-tolerant workloads
- Namespace cost attribution
Quick Self-Assessment
Count your checkmarks:
- 40+: Production-ready
- 30–39: Mostly ready, address gaps before scaling
- 20–29: Significant gaps — prioritize security and observability
- Under 20: Not production-ready — focus on fundamentals first
This is a living checklist. Revisit quarterly as your platform matures and new failure modes emerge.
Looking for Kubernetes roles? Browse current DevOps openings or prep with our Kubernetes interview guide.