Kubernetes Operations Checklist — 34 Must-Check Items Before Production Deploy
A 34-item checklist for running Kubernetes clusters reliably in production. Organized by resources, availability, security, network, storage, monitoring, deploy process, and cost.
TestForge Team ·
Why You Need a Checklist
Kubernetes has so many features that the gap between “it works” and “production-ready” is enormous.
70% of incidents come from predictable configuration gaps.
1. Resource Management
- Set
resources.requestsandresources.limitson every Deployment - Configure HorizontalPodAutoscaler (HPA) based on CPU/memory
- Set PodDisruptionBudget (PDB) — ensures minimum available Pods during rolling updates
- Use LimitRange to set namespace-level default resource limits
- Use ResourceQuota to cap total resource usage per namespace
# PodDisruptionBudget example
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: my-app-pdb
spec:
minAvailable: 2 # Keep at least 2 Pods available
selector:
matchLabels:
app: my-app
2. Availability
- Set
replicasto at least 2 (a single Pod is a SPOF) - Use
topologySpreadConstraintsto distribute across availability zones - Configure RollingUpdate strategy (
maxSurge,maxUnavailable) - Set Liveness and Readiness Probes
- Configure
terminationGracePeriodSeconds(default is 30s)
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: my-app
3. Security
- Set
runAsNonRoot: true - Set
readOnlyRootFilesystem: true - Set
allowPrivilegeEscalation: false - Apply principle of least privilege to ServiceAccounts (RBAC)
- Mount Secrets as files rather than environment variables
- Use NetworkPolicy to restrict inter-Pod communication
- Scan container images for vulnerabilities (Trivy, Snyk)
securityContext:
runAsNonRoot: true
runAsUser: 1000
readOnlyRootFilesystem: true
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
4. Network
- Configure TLS on Ingress (cert-manager)
- Default Service type to
ClusterIP; useLoadBalanceronly when external exposure is required - Use NetworkPolicy to control cross-namespace access
- Check DNS TTL settings (
ndotsconfiguration)
5. Storage
- Verify PersistentVolume
reclaimPolicy(Retainis recommended) - Enable
allowVolumeExpansion: trueon StorageClass - Configure
volumeClaimTemplateswhen using StatefulSets
6. Monitoring & Alerts
- Install Prometheus + Grafana
- Set up critical alerts: Pod restarts, Node NotReady, HPA at max replicas
- Collect logs: Fluent Bit → Elasticsearch/Loki
- Verify
kubectl top nodesandkubectl top podswork
# Example alert (Prometheus Alert Rule)
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels:
severity: critical
7. Deploy Process
- Never use the
latesttag — pin version tags - Set
imagePullPolicy: Always - Use Helm Chart or Kustomize to separate config per environment
- Apply GitOps (ArgoCD/Flux)
- Document and test rollback procedures
# Quick rollback
kubectl rollout undo deployment/my-app
# Roll back to a specific revision
kubectl rollout undo deployment/my-app --to-revision=3
8. Cost Optimization
- Use Spot/Preemptible instances for non-critical workloads
- Clean up unused namespaces and resources
- Use VPA (VerticalPodAutoscaler) to right-size requests
- Configure Cluster Autoscaler
Quick Cluster Health Check
# Node status
kubectl get nodes -o wide
# Pods that are not Running/Completed
kubectl get pods -A | grep -v Running | grep -v Completed
# Resource utilization
kubectl top nodes
kubectl top pods -A --sort-by=memory
# Recent warning events
kubectl get events -A --field-selector type=Warning --sort-by='.lastTimestamp'