TestForge Blog
← All Posts

Kubernetes Node Failure Response Guide — From NotReady to Recovery

Step-by-step response when a Kubernetes Node enters NotReady state. Root cause diagnosis, workload evacuation, and recovery procedures — a real-world operations guide.

TestForge Team ·

Detecting Node NotReady

kubectl get nodes
# NAME          STATUS     ROLES    AGE
# node-1        Ready      worker   30d
# node-2        NotReady   worker   30d  ← failure
# node-3        Ready      worker   30d

Immediate Response: Drain Workloads

The first thing to do during a Node failure is move workloads to other Nodes.

# 1. Mark as unschedulable (stop placing new Pods)
kubectl cordon node-2

# 2. Safely evacuate existing Pods (respects PDB)
kubectl drain node-2 \
  --ignore-daemonsets \
  --delete-emptydir-data \
  --grace-period=60

# If blocked by PDB — force (use with caution)
kubectl drain node-2 --force --ignore-daemonsets --delete-emptydir-data

Warning: --force ignores PodDisruptionBudgets. Service impact may occur.

Root Cause Diagnosis

# Check Node events in detail
kubectl describe node node-2

# Key fields to review:
# Conditions: MemoryPressure, DiskPressure, PIDPressure, Ready
# Events: most recent failure events

Cause 1: kubelet Stopped

# SSH into the Node
ssh node-2

# Check kubelet status
systemctl status kubelet

# Try restarting
sudo systemctl restart kubelet
sudo journalctl -u kubelet -n 100 --no-pager  # Review error logs

Cause 2: Disk Exhaustion

# Check disk utilization
df -h

# Find large files/directories
du -sh /var/lib/docker/* 2>/dev/null | sort -rh | head -10
du -sh /var/log/* | sort -rh | head -10

# Clean up unused Docker resources
docker system prune -f
docker volume prune -f

# Clean containerd cache
crictl rmi --prune

Cause 3: Memory Exhaustion (OOMKilled)

# Check dmesg for OOM records
dmesg | grep -i oom | tail -20

# Top memory-consuming processes
free -h
top -o %MEM

Cause 4: Network Disconnection

# Check CNI plugin status
kubectl get pods -n kube-system | grep -E "calico|flannel|cilium|weave"

# Test Node network connectivity
ping <control-plane-ip>
curl -k https://<control-plane-ip>:6443/healthz

Cause 5: Certificate Expiry

# Check certificate expiration dates
kubeadm certs check-expiration

# Renew certificates
kubeadm certs renew all
systemctl restart kubelet

Recovery Procedure

# 1. Confirm Node is recovered
kubectl get node node-2
# Verify STATUS has returned to Ready

# 2. Re-enable scheduling
kubectl uncordon node-2

# 3. Confirm Pod distribution
kubectl get pods -o wide | grep node-2

Automated Node Recovery

Node Problem Detector + Cluster Autoscaler

# Install Node Problem Detector (Helm)
helm install node-problem-detector \
  deliveryhero/node-problem-detector \
  --namespace kube-system

# Cluster Autoscaler automatically replaces NotReady Nodes
--balance-similar-node-groups=true
--skip-nodes-with-system-pods=false

Readiness-Based Auto-Removal (AWS EKS)

# Node Termination Handler
helm install aws-node-termination-handler \
  aws/aws-node-termination-handler \
  --namespace kube-system \
  --set enableSpotInterruptionDraining=true \
  --set enableRebalanceDraining=true

Prevention: Node Monitoring Alerts

# Prometheus Alert Rules
- alert: KubernetesNodeNotReady
  expr: kube_node_status_condition{condition="Ready",status="true"} == 0
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "Node {{ $labels.node }} is NotReady"

- alert: KubernetesNodeDiskPressure
  expr: kube_node_status_condition{condition="DiskPressure",status="true"} == 1
  for: 1m
  labels:
    severity: warning

Response Checklist

StepCommand
1. Evacuate workloadskubectl drain node-2 --ignore-daemonsets
2. Identify root causekubectl describe node node-2
3. Check kubeletsystemctl status kubelet
4. Check disk/memorydf -h, free -h
5. Restore after recoverykubectl uncordon node-2