technical 20 May 2026 by FzlOps Team

Kubernetes Production Readiness Checklist for 2026

Kubernetes Production Readiness Checklist

Running Kubernetes in dev is easy. Running it reliably in production requires deliberate engineering. This checklist covers what teams commonly miss before going live.

Workload Configuration

Resource Management

CPU and memory requests set on every container
Memory limits set (prevents OOM killing neighbors)
CPU limits set only if you understand throttling trade-offs
Resource quotas per namespace
LimitRanges to enforce defaults

Health Checks

Readiness probes on every service (controls traffic routing)
Liveness probes with conservative thresholds (prevents restart storms)
Startup probes for slow-starting applications
Probes test actual functionality, not just TCP port

Pod Configuration

Run as non-root user
Read-only root filesystem where possible
No privilege escalation (allowPrivilegeEscalation: false)
Explicit securityContext on every pod
Anti-affinity rules for critical services (spread across nodes)

Networking

Services & Ingress

Internal services use ClusterIP (not NodePort)
Ingress controller with TLS termination
Certificate auto-renewal (cert-manager)
Rate limiting at ingress level
Connection timeouts configured

Network Policies

Default deny ingress in sensitive namespaces
Explicit allow rules for known traffic paths
Egress restrictions for workloads that shouldn’t reach the internet
Verified with a CNI that enforces policies (Calico, Cilium)

Observability

Metrics

Prometheus scraping all workloads
RED metrics per service (Rate, Errors, Duration)
Node-level metrics (node-exporter)
kube-state-metrics deployed
Grafana dashboards for critical services

Logging

Structured logging (JSON) from applications
Centralized log aggregation (Loki, EFK, CloudWatch)
Log retention policy defined
No sensitive data in logs

Alerting

SLO-based alerts (not just threshold alerts)
On-call rotation configured
Runbooks linked to every alert
Alert fatigue reviewed monthly

Security

RBAC

No cluster-admin for workloads
Service accounts per workload (not default)
Namespace-scoped roles preferred over ClusterRoles
Regular audit of permissions

Secrets

External secret management (Vault, AWS SM, GCP SM)
No secrets in environment variables visible in pod spec
Secret rotation strategy documented
Encryption at rest enabled for etcd

Image Security

Images from trusted registries only
Image scanning in CI pipeline (Trivy, Snyk)
No latest tag in production
Image pull policy: IfNotPresent or digest-based

Reliability

High Availability

Multiple replicas for stateless services (min 2, ideally 3)
PodDisruptionBudgets on critical services
Multi-AZ node pools
Control plane HA (managed K8s handles this)
etcd backup strategy (self-managed clusters)

Autoscaling

HPA configured based on relevant metrics
Cluster Autoscaler or Karpenter for node scaling
Scale-down cooldown configured
Load tested to verify scaling behavior

Disaster Recovery

Cluster rebuild documented and tested
Persistent volume backup strategy (Velero)
GitOps repo is the source of truth
Recovery time objective (RTO) defined

Deployment

Rollout Strategy

Rolling updates with proper maxSurge/maxUnavailable
Rollback procedure tested
Canary or blue/green for critical services
Deployment blocked if tests fail

GitOps

All manifests in version control
ArgoCD/Flux for continuous deployment
Drift detection enabled
Manual changes to cluster are flagged

Maintenance

Upgrades

Kubernetes version within supported window
Upgrade tested in staging first
Node rotation strategy (cordon, drain, replace)
Version skew policy understood

Cost

Unused resources identified (Kubecost, cloud cost explorer)
Right-sizing recommendations reviewed monthly
Spot/preemptible nodes for fault-tolerant workloads
Namespace cost attribution

Quick Self-Assessment

Count your checkmarks:

40+: Production-ready
30–39: Mostly ready, address gaps before scaling
20–29: Significant gaps — prioritize security and observability
Under 20: Not production-ready — focus on fundamentals first

This is a living checklist. Revisit quarterly as your platform matures and new failure modes emerge.

Looking for Kubernetes roles? Browse current DevOps openings or prep with our Kubernetes interview guide.

← All articles Browse jobs