TestForge Blog
← All Posts

Kubernetes Operations Checklist — 34 Must-Check Items Before Production Deploy

A 34-item checklist for running Kubernetes clusters reliably in production. Organized by resources, availability, security, network, storage, monitoring, deploy process, and cost.

TestForge Team ·

Why You Need a Checklist

Kubernetes has so many features that the gap between “it works” and “production-ready” is enormous.
70% of incidents come from predictable configuration gaps.


1. Resource Management

  • Set resources.requests and resources.limits on every Deployment
  • Configure HorizontalPodAutoscaler (HPA) based on CPU/memory
  • Set PodDisruptionBudget (PDB) — ensures minimum available Pods during rolling updates
  • Use LimitRange to set namespace-level default resource limits
  • Use ResourceQuota to cap total resource usage per namespace
# PodDisruptionBudget example
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: my-app-pdb
spec:
  minAvailable: 2  # Keep at least 2 Pods available
  selector:
    matchLabels:
      app: my-app

2. Availability

  • Set replicas to at least 2 (a single Pod is a SPOF)
  • Use topologySpreadConstraints to distribute across availability zones
  • Configure RollingUpdate strategy (maxSurge, maxUnavailable)
  • Set Liveness and Readiness Probes
  • Configure terminationGracePeriodSeconds (default is 30s)
topologySpreadConstraints:
- maxSkew: 1
  topologyKey: topology.kubernetes.io/zone
  whenUnsatisfiable: DoNotSchedule
  labelSelector:
    matchLabels:
      app: my-app

3. Security

  • Set runAsNonRoot: true
  • Set readOnlyRootFilesystem: true
  • Set allowPrivilegeEscalation: false
  • Apply principle of least privilege to ServiceAccounts (RBAC)
  • Mount Secrets as files rather than environment variables
  • Use NetworkPolicy to restrict inter-Pod communication
  • Scan container images for vulnerabilities (Trivy, Snyk)
securityContext:
  runAsNonRoot: true
  runAsUser: 1000
  readOnlyRootFilesystem: true
  allowPrivilegeEscalation: false
  capabilities:
    drop: ["ALL"]

4. Network

  • Configure TLS on Ingress (cert-manager)
  • Default Service type to ClusterIP; use LoadBalancer only when external exposure is required
  • Use NetworkPolicy to control cross-namespace access
  • Check DNS TTL settings (ndots configuration)

5. Storage

  • Verify PersistentVolume reclaimPolicy (Retain is recommended)
  • Enable allowVolumeExpansion: true on StorageClass
  • Configure volumeClaimTemplates when using StatefulSets

6. Monitoring & Alerts

  • Install Prometheus + Grafana
  • Set up critical alerts: Pod restarts, Node NotReady, HPA at max replicas
  • Collect logs: Fluent Bit → Elasticsearch/Loki
  • Verify kubectl top nodes and kubectl top pods work
# Example alert (Prometheus Alert Rule)
- alert: PodCrashLooping
  expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
  for: 5m
  labels:
    severity: critical

7. Deploy Process

  • Never use the latest tag — pin version tags
  • Set imagePullPolicy: Always
  • Use Helm Chart or Kustomize to separate config per environment
  • Apply GitOps (ArgoCD/Flux)
  • Document and test rollback procedures
# Quick rollback
kubectl rollout undo deployment/my-app

# Roll back to a specific revision
kubectl rollout undo deployment/my-app --to-revision=3

8. Cost Optimization

  • Use Spot/Preemptible instances for non-critical workloads
  • Clean up unused namespaces and resources
  • Use VPA (VerticalPodAutoscaler) to right-size requests
  • Configure Cluster Autoscaler

Quick Cluster Health Check

# Node status
kubectl get nodes -o wide

# Pods that are not Running/Completed
kubectl get pods -A | grep -v Running | grep -v Completed

# Resource utilization
kubectl top nodes
kubectl top pods -A --sort-by=memory

# Recent warning events
kubectl get events -A --field-selector type=Warning --sort-by='.lastTimestamp'