Kubernetes Node Failure Response Guide — From NotReady to Recovery
Step-by-step response when a Kubernetes Node enters NotReady state. Root cause diagnosis, workload evacuation, and recovery procedures — a real-world operations guide.
TestForge Team ·
Detecting Node NotReady
kubectl get nodes
# NAME STATUS ROLES AGE
# node-1 Ready worker 30d
# node-2 NotReady worker 30d ← failure
# node-3 Ready worker 30d
Immediate Response: Drain Workloads
The first thing to do during a Node failure is move workloads to other Nodes.
# 1. Mark as unschedulable (stop placing new Pods)
kubectl cordon node-2
# 2. Safely evacuate existing Pods (respects PDB)
kubectl drain node-2 \
--ignore-daemonsets \
--delete-emptydir-data \
--grace-period=60
# If blocked by PDB — force (use with caution)
kubectl drain node-2 --force --ignore-daemonsets --delete-emptydir-data
Warning: --force ignores PodDisruptionBudgets. Service impact may occur.
Root Cause Diagnosis
# Check Node events in detail
kubectl describe node node-2
# Key fields to review:
# Conditions: MemoryPressure, DiskPressure, PIDPressure, Ready
# Events: most recent failure events
Cause 1: kubelet Stopped
# SSH into the Node
ssh node-2
# Check kubelet status
systemctl status kubelet
# Try restarting
sudo systemctl restart kubelet
sudo journalctl -u kubelet -n 100 --no-pager # Review error logs
Cause 2: Disk Exhaustion
# Check disk utilization
df -h
# Find large files/directories
du -sh /var/lib/docker/* 2>/dev/null | sort -rh | head -10
du -sh /var/log/* | sort -rh | head -10
# Clean up unused Docker resources
docker system prune -f
docker volume prune -f
# Clean containerd cache
crictl rmi --prune
Cause 3: Memory Exhaustion (OOMKilled)
# Check dmesg for OOM records
dmesg | grep -i oom | tail -20
# Top memory-consuming processes
free -h
top -o %MEM
Cause 4: Network Disconnection
# Check CNI plugin status
kubectl get pods -n kube-system | grep -E "calico|flannel|cilium|weave"
# Test Node network connectivity
ping <control-plane-ip>
curl -k https://<control-plane-ip>:6443/healthz
Cause 5: Certificate Expiry
# Check certificate expiration dates
kubeadm certs check-expiration
# Renew certificates
kubeadm certs renew all
systemctl restart kubelet
Recovery Procedure
# 1. Confirm Node is recovered
kubectl get node node-2
# Verify STATUS has returned to Ready
# 2. Re-enable scheduling
kubectl uncordon node-2
# 3. Confirm Pod distribution
kubectl get pods -o wide | grep node-2
Automated Node Recovery
Node Problem Detector + Cluster Autoscaler
# Install Node Problem Detector (Helm)
helm install node-problem-detector \
deliveryhero/node-problem-detector \
--namespace kube-system
# Cluster Autoscaler automatically replaces NotReady Nodes
--balance-similar-node-groups=true
--skip-nodes-with-system-pods=false
Readiness-Based Auto-Removal (AWS EKS)
# Node Termination Handler
helm install aws-node-termination-handler \
aws/aws-node-termination-handler \
--namespace kube-system \
--set enableSpotInterruptionDraining=true \
--set enableRebalanceDraining=true
Prevention: Node Monitoring Alerts
# Prometheus Alert Rules
- alert: KubernetesNodeNotReady
expr: kube_node_status_condition{condition="Ready",status="true"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Node {{ $labels.node }} is NotReady"
- alert: KubernetesNodeDiskPressure
expr: kube_node_status_condition{condition="DiskPressure",status="true"} == 1
for: 1m
labels:
severity: warning
Response Checklist
| Step | Command |
|---|---|
| 1. Evacuate workloads | kubectl drain node-2 --ignore-daemonsets |
| 2. Identify root cause | kubectl describe node node-2 |
| 3. Check kubelet | systemctl status kubelet |
| 4. Check disk/memory | df -h, free -h |
| 5. Restore after recovery | kubectl uncordon node-2 |