Kubernetes CrashLoopBackOff: Complete Root Cause Analysis Guide
Step-by-step diagnosis of CrashLoopBackOff — from OOMKilled and missing config to liveness probe misconfigurations. Includes kubectl commands and real-world patterns.
CrashLoopBackOff is one of the most common pod states in Kubernetes — and one of the most frustrating to diagnose. The pod crashes, Kubernetes restarts it, it crashes again, and the backoff timer grows exponentially. This guide walks through every root cause systematically.
Understanding CrashLoopBackOff
When a container exits with a non-zero code, Kubernetes restarts it automatically. If it keeps failing, the restart interval increases: 10s → 20s → 40s → 80s → 160s → 300s (capped). That escalating delay is the “BackOff.”
kubectl get pods -n your-namespace
# NAME READY STATUS RESTARTS AGE
# api-server-7d9f8b6c4-xk2p9 0/1 CrashLoopBackOff 8 12m
Step 1: Read the Exit Code
Exit codes tell you the category of failure before you read a single log line.
kubectl describe pod <pod-name> -n <namespace>
Look for Last State → Exit Code:
| Exit Code | Meaning |
|---|---|
1 | Application error (unhandled exception, missing config) |
2 | Misuse of shell command |
126 | Command found but not executable |
127 | Command not found (wrong entrypoint/image) |
128 | Invalid signal |
137 | SIGKILL — almost always OOMKilled |
139 | Segfault (SIGSEGV) |
143 | SIGTERM — graceful shutdown (liveness probe or preStop issue) |
255 | Application called exit(-1) |
Exit code 137 means the container was killed by the kernel — jump straight to OOMKilled.
Step 2: Read Logs — Current and Previous
# Current crash logs
kubectl logs <pod-name> -n <namespace>
# Previous container's logs (before the restart — often the most useful)
kubectl logs <pod-name> -n <namespace> --previous
# If the pod has multiple containers
kubectl logs <pod-name> -n <namespace> -c <container-name> --previous
The --previous flag is critical. By the time you run kubectl logs, the pod has often already restarted and the current logs are nearly empty.
Root Cause 1: OOMKilled (Exit Code 137) {#oomkilled}
The Linux kernel killed the process because it exceeded the container’s memory limit.
kubectl describe pod <pod-name> | grep -A5 "Last State"
# Last State: Terminated
# Reason: OOMKilled
# Exit Code: 137
Diagnosis:
# Check memory limits vs actual usage
kubectl top pod <pod-name> -n <namespace>
# Check node memory pressure
kubectl describe node <node-name> | grep -A10 "Conditions"
Fix options:
# Option 1: Increase memory limit
resources:
requests:
memory: "256Mi"
limits:
memory: "512Mi" # was 256Mi
# Option 2: Fix the memory leak in your application
# Option 3: Enable JVM heap limits (for Java apps)
env:
- name: JAVA_OPTS
value: "-XX:MaxRAMPercentage=75.0 -XX:+UseContainerSupport"
JVM trap: Without
-XX:+UseContainerSupport(Java 8u191+ / Java 11+), the JVM reads host memory instead of container limits and allocates a heap that overflows the container.
Root Cause 2: Application Startup Failure (Exit Code 1)
The most common cause. Your application starts, throws an exception, and exits.
Common triggers:
Missing environment variables
kubectl logs <pod-name> --previous | grep -iE "error|fatal|exception|not found"
# Common patterns:
Error: DATABASE_URL environment variable is required
IllegalArgumentException: Required config property 'spring.datasource.url' not found
panic: runtime error: invalid memory address (nil pointer dereference)
Fix:
# Verify env vars are injected
kubectl exec <pod-name> -- env | grep DATABASE
# Check if secret/configmap exists
kubectl get secret db-credentials -n <namespace>
kubectl get configmap app-config -n <namespace>
# Verify your Pod spec references are correct
envFrom:
- secretRef:
name: db-credentials # must match kubectl get secret output
- configMapRef:
name: app-config
Database connection failure at startup
Many frameworks (Spring Boot, Django, Rails) try to connect to the database during startup and crash if it fails.
HikariPool: Exception during pool initialization
org.postgresql.util.PSQLException: Connection to postgres:5432 refused
Fix — add startup resilience:
# For Spring Boot: retry connection
env:
- name: SPRING_DATASOURCE_HIKARI_CONNECTION-TIMEOUT
value: "30000"
- name: SPRING_DATASOURCE_HIKARI_INITIALIZATION-FAIL-TIMEOUT
value: "60000"
Or use an init container to wait for the database:
initContainers:
- name: wait-for-db
image: busybox:1.36
command: ['sh', '-c', 'until nc -z postgres-svc 5432; do echo waiting; sleep 2; done']
Root Cause 3: Wrong Entrypoint / Command Not Found (Exit Code 127)
kubectl describe pod <pod-name> | grep -A3 "Command\|Args"
# Pod logs show:
exec: "/app/start.sh": stat /app/start.sh: no such file or directory
Common causes:
- Wrong image tag (using
latestand got a different build) - Script not copied into the image
- Wrong working directory assumption
# Debug by overriding the entrypoint
kubectl run debug --image=your-image:tag --restart=Never -it --rm \
-- /bin/sh
# Inside the container:
ls /app/
which your-binary
Root Cause 4: Liveness Probe Killing the Pod (Exit Code 143)
If your liveness probe fails, Kubernetes sends SIGTERM (and eventually SIGKILL) to the container. The pod appears to crash, but the application itself is fine — the probe threshold is wrong.
kubectl describe pod <pod-name> | grep -A10 "Liveness"
# Liveness: http-get http://:8080/health delay=5s timeout=1s period=10s #success=1 #failure=3
# Events section will show:
# Warning Unhealthy Liveness probe failed: HTTP probe failed with statuscode: 503
# Normal Killing Container api killed on liveness probe failure
Fix — tune probe timing:
livenessProbe:
httpGet:
path: /actuator/health/liveness
port: 8080
initialDelaySeconds: 60 # give app time to start (was 5)
periodSeconds: 10
timeoutSeconds: 5 # increase if health check is slow (was 1)
failureThreshold: 3
startupProbe: # use this for slow-starting apps
httpGet:
path: /actuator/health/liveness
port: 8080
failureThreshold: 30 # 30 * 10s = 5 minutes to start
periodSeconds: 10
Key insight:
startupProbedisableslivenessProbeuntil it succeeds. Use it for any app that takes more than 30 seconds to start (JVM, Python with large imports, etc.)
Root Cause 5: Resource Limits Too Low (Throttling → Timeout)
CPU throttling doesn’t kill the pod directly but can cause liveness probe timeouts if the container can’t respond within the timeout window.
# Check CPU throttling
kubectl top pod <pod-name>
# More detail with metrics-server
kubectl describe hpa -n <namespace>
# Node-level throttling check
kubectl describe node <node-name> | grep -A5 "Allocated resources"
Fix:
resources:
requests:
cpu: "250m" # what the scheduler reserves
memory: "256Mi"
limits:
cpu: "1000m" # allow bursting (no CPU limit = no throttling, but risky)
memory: "512Mi"
Root Cause 6: Image Pull Errors (Not Exactly CrashLoop, but Common)
kubectl describe pod <pod-name> | grep -A5 "Events"
# Warning Failed Failed to pull image "registry.example.com/app:v1.2.3":
# rpc error: code = Unknown desc = failed to pull and unpack image:
# unexpected status code 401
Fix:
# Create registry secret
kubectl create secret docker-registry regcred \
--docker-server=registry.example.com \
--docker-username=<user> \
--docker-password=<token> \
-n <namespace>
# Reference in pod spec
imagePullSecrets:
- name: regcred
Diagnostic Runbook (Quick Reference)
# 1. Get pod status and restart count
kubectl get pod <pod> -n <ns> -o wide
# 2. Get exit code and events
kubectl describe pod <pod> -n <ns>
# 3. Read previous container logs
kubectl logs <pod> -n <ns> --previous
# 4. Check resource usage
kubectl top pod <pod> -n <ns>
# 5. Interactive debug (override entrypoint)
kubectl debug -it <pod> --image=busybox --target=<container> -n <ns>
# 6. Copy debug tools into running container (K8s 1.25+)
kubectl debug -it <pod> --image=nicolaka/netshoot --share-processes --copy-to=debug-pod
TestForge Integration
Once your pods are stable, load testing reveals the next layer of issues: memory pressure under concurrent load, connection pool exhaustion, or GC pauses that trigger liveness probe failures at scale.
TestForge scans your API endpoints, generates realistic load scenarios, and surfaces these issues before your users do — with no manual test script writing required.
Related: JVM Memory Tuning for Containers · Spring WebFlux Performance