Kubernetes Operations Checklist — 34 Must-Check Items Before Production Deploy

Why You Need a Checklist

Kubernetes has so many features that the gap between “it works” and “production-ready” is enormous.
70% of incidents come from predictable configuration gaps.

1. Resource Management

Set resources.requests and resources.limits on every Deployment
Configure HorizontalPodAutoscaler (HPA) based on CPU/memory
Set PodDisruptionBudget (PDB) — ensures minimum available Pods during rolling updates
Use LimitRange to set namespace-level default resource limits
Use ResourceQuota to cap total resource usage per namespace

# PodDisruptionBudget example
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: my-app-pdb
spec:
  minAvailable: 2  # Keep at least 2 Pods available
  selector:
    matchLabels:
      app: my-app

2. Availability

Set replicas to at least 2 (a single Pod is a SPOF)
Use topologySpreadConstraints to distribute across availability zones
Configure RollingUpdate strategy (maxSurge, maxUnavailable)
Set Liveness and Readiness Probes
Configure terminationGracePeriodSeconds (default is 30s)

topologySpreadConstraints:
- maxSkew: 1
  topologyKey: topology.kubernetes.io/zone
  whenUnsatisfiable: DoNotSchedule
  labelSelector:
    matchLabels:
      app: my-app

3. Security

Set runAsNonRoot: true
Set readOnlyRootFilesystem: true
Set allowPrivilegeEscalation: false
Apply principle of least privilege to ServiceAccounts (RBAC)
Mount Secrets as files rather than environment variables
Use NetworkPolicy to restrict inter-Pod communication
Scan container images for vulnerabilities (Trivy, Snyk)

securityContext:
  runAsNonRoot: true
  runAsUser: 1000
  readOnlyRootFilesystem: true
  allowPrivilegeEscalation: false
  capabilities:
    drop: ["ALL"]

4. Network

Configure TLS on Ingress (cert-manager)
Default Service type to ClusterIP; use LoadBalancer only when external exposure is required
Use NetworkPolicy to control cross-namespace access
Check DNS TTL settings (ndots configuration)

5. Storage

Verify PersistentVolume reclaimPolicy (Retain is recommended)
Enable allowVolumeExpansion: true on StorageClass
Configure volumeClaimTemplates when using StatefulSets

6. Monitoring & Alerts

Install Prometheus + Grafana
Set up critical alerts: Pod restarts, Node NotReady, HPA at max replicas
Collect logs: Fluent Bit → Elasticsearch/Loki
Verify kubectl top nodes and kubectl top pods work

# Example alert (Prometheus Alert Rule)
- alert: PodCrashLooping
  expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
  for: 5m
  labels:
    severity: critical

7. Deploy Process

Never use the latest tag — pin version tags
Set imagePullPolicy: Always
Use Helm Chart or Kustomize to separate config per environment
Apply GitOps (ArgoCD/Flux)
Document and test rollback procedures

# Quick rollback
kubectl rollout undo deployment/my-app

# Roll back to a specific revision
kubectl rollout undo deployment/my-app --to-revision=3

8. Cost Optimization

Use Spot/Preemptible instances for non-critical workloads
Clean up unused namespaces and resources
Use VPA (VerticalPodAutoscaler) to right-size requests
Configure Cluster Autoscaler

Quick Cluster Health Check

# Node status
kubectl get nodes -o wide

# Pods that are not Running/Completed
kubectl get pods -A | grep -v Running | grep -v Completed

# Resource utilization
kubectl top nodes
kubectl top pods -A --sort-by=memory

# Recent warning events
kubectl get events -A --field-selector type=Warning --sort-by='.lastTimestamp'

A practical hub for operating and improving AI services

Why You Need a Checklist

1. Resource Management

2. Availability

3. Security

4. Network

5. Storage

6. Monitoring & Alerts

7. Deploy Process

8. Cost Optimization

Quick Cluster Health Check