Kubernetes CrashLoopBackOff: Complete Root Cause Analysis Guide

CrashLoopBackOff is one of the most common pod states in Kubernetes — and one of the most frustrating to diagnose. The pod crashes, Kubernetes restarts it, it crashes again, and the backoff timer grows exponentially. This guide walks through every root cause systematically.

Understanding CrashLoopBackOff

When a container exits with a non-zero code, Kubernetes restarts it automatically. If it keeps failing, the restart interval increases: 10s → 20s → 40s → 80s → 160s → 300s (capped). That escalating delay is the “BackOff.”

kubectl get pods -n your-namespace
# NAME                        READY   STATUS             RESTARTS   AGE
# api-server-7d9f8b6c4-xk2p9  0/1     CrashLoopBackOff   8          12m

Step 1: Read the Exit Code

Exit codes tell you the category of failure before you read a single log line.

kubectl describe pod <pod-name> -n <namespace>

Look for Last State → Exit Code:

Exit Code	Meaning
`1`	Application error (unhandled exception, missing config)
`2`	Misuse of shell command
`126`	Command found but not executable
`127`	Command not found (wrong entrypoint/image)
`128`	Invalid signal
`137`	`SIGKILL` — almost always OOMKilled
`139`	Segfault (`SIGSEGV`)
`143`	`SIGTERM` — graceful shutdown (liveness probe or preStop issue)
`255`	Application called `exit(-1)`

Exit code 137 means the container was killed by the kernel — jump straight to OOMKilled.

Step 2: Read Logs — Current and Previous

# Current crash logs
kubectl logs <pod-name> -n <namespace>

# Previous container's logs (before the restart — often the most useful)
kubectl logs <pod-name> -n <namespace> --previous

# If the pod has multiple containers
kubectl logs <pod-name> -n <namespace> -c <container-name> --previous

The --previous flag is critical. By the time you run kubectl logs, the pod has often already restarted and the current logs are nearly empty.

Root Cause 1: OOMKilled (Exit Code 137) {#oomkilled}

The Linux kernel killed the process because it exceeded the container’s memory limit.

kubectl describe pod <pod-name> | grep -A5 "Last State"
# Last State:     Terminated
#   Reason:       OOMKilled
#   Exit Code:    137

Diagnosis:

# Check memory limits vs actual usage
kubectl top pod <pod-name> -n <namespace>

# Check node memory pressure
kubectl describe node <node-name> | grep -A10 "Conditions"

Fix options:

# Option 1: Increase memory limit
resources:
  requests:
    memory: "256Mi"
  limits:
    memory: "512Mi"   # was 256Mi

# Option 2: Fix the memory leak in your application
# Option 3: Enable JVM heap limits (for Java apps)
env:
  - name: JAVA_OPTS
    value: "-XX:MaxRAMPercentage=75.0 -XX:+UseContainerSupport"

JVM trap: Without -XX:+UseContainerSupport (Java 8u191+ / Java 11+), the JVM reads host memory instead of container limits and allocates a heap that overflows the container.

Root Cause 2: Application Startup Failure (Exit Code 1)

The most common cause. Your application starts, throws an exception, and exits.

Common triggers:

Missing environment variables

kubectl logs <pod-name> --previous | grep -iE "error|fatal|exception|not found"

# Common patterns:
Error: DATABASE_URL environment variable is required
IllegalArgumentException: Required config property 'spring.datasource.url' not found
panic: runtime error: invalid memory address (nil pointer dereference)

Fix:

# Verify env vars are injected
kubectl exec <pod-name> -- env | grep DATABASE

# Check if secret/configmap exists
kubectl get secret db-credentials -n <namespace>
kubectl get configmap app-config -n <namespace>

# Verify your Pod spec references are correct
envFrom:
  - secretRef:
      name: db-credentials    # must match kubectl get secret output
  - configMapRef:
      name: app-config

Database connection failure at startup

Many frameworks (Spring Boot, Django, Rails) try to connect to the database during startup and crash if it fails.

HikariPool: Exception during pool initialization
org.postgresql.util.PSQLException: Connection to postgres:5432 refused

Fix — add startup resilience:

# For Spring Boot: retry connection
env:
  - name: SPRING_DATASOURCE_HIKARI_CONNECTION-TIMEOUT
    value: "30000"
  - name: SPRING_DATASOURCE_HIKARI_INITIALIZATION-FAIL-TIMEOUT
    value: "60000"

Or use an init container to wait for the database:

initContainers:
  - name: wait-for-db
    image: busybox:1.36
    command: ['sh', '-c', 'until nc -z postgres-svc 5432; do echo waiting; sleep 2; done']

Root Cause 3: Wrong Entrypoint / Command Not Found (Exit Code 127)

kubectl describe pod <pod-name> | grep -A3 "Command\|Args"

# Pod logs show:
exec: "/app/start.sh": stat /app/start.sh: no such file or directory

Common causes:

Wrong image tag (using latest and got a different build)
Script not copied into the image
Wrong working directory assumption

# Debug by overriding the entrypoint
kubectl run debug --image=your-image:tag --restart=Never -it --rm \
  -- /bin/sh

# Inside the container:
ls /app/
which your-binary

Root Cause 4: Liveness Probe Killing the Pod (Exit Code 143)

If your liveness probe fails, Kubernetes sends SIGTERM (and eventually SIGKILL) to the container. The pod appears to crash, but the application itself is fine — the probe threshold is wrong.

kubectl describe pod <pod-name> | grep -A10 "Liveness"
# Liveness:  http-get http://:8080/health delay=5s timeout=1s period=10s #success=1 #failure=3

# Events section will show:
# Warning  Unhealthy  Liveness probe failed: HTTP probe failed with statuscode: 503
# Normal   Killing    Container api killed on liveness probe failure

Fix — tune probe timing:

livenessProbe:
  httpGet:
    path: /actuator/health/liveness
    port: 8080
  initialDelaySeconds: 60    # give app time to start (was 5)
  periodSeconds: 10
  timeoutSeconds: 5          # increase if health check is slow (was 1)
  failureThreshold: 3

startupProbe:                # use this for slow-starting apps
  httpGet:
    path: /actuator/health/liveness
    port: 8080
  failureThreshold: 30       # 30 * 10s = 5 minutes to start
  periodSeconds: 10

Key insight: startupProbe disables livenessProbe until it succeeds. Use it for any app that takes more than 30 seconds to start (JVM, Python with large imports, etc.)

Root Cause 5: Resource Limits Too Low (Throttling → Timeout)

CPU throttling doesn’t kill the pod directly but can cause liveness probe timeouts if the container can’t respond within the timeout window.

# Check CPU throttling
kubectl top pod <pod-name>

# More detail with metrics-server
kubectl describe hpa -n <namespace>

# Node-level throttling check
kubectl describe node <node-name> | grep -A5 "Allocated resources"

Fix:

resources:
  requests:
    cpu: "250m"     # what the scheduler reserves
    memory: "256Mi"
  limits:
    cpu: "1000m"    # allow bursting (no CPU limit = no throttling, but risky)
    memory: "512Mi"

Root Cause 6: Image Pull Errors (Not Exactly CrashLoop, but Common)

kubectl describe pod <pod-name> | grep -A5 "Events"
# Warning  Failed  Failed to pull image "registry.example.com/app:v1.2.3": 
#          rpc error: code = Unknown desc = failed to pull and unpack image: 
#          unexpected status code 401

Fix:

# Create registry secret
kubectl create secret docker-registry regcred \
  --docker-server=registry.example.com \
  --docker-username=<user> \
  --docker-password=<token> \
  -n <namespace>

# Reference in pod spec
imagePullSecrets:
  - name: regcred

Diagnostic Runbook (Quick Reference)

# 1. Get pod status and restart count
kubectl get pod <pod> -n <ns> -o wide

# 2. Get exit code and events
kubectl describe pod <pod> -n <ns>

# 3. Read previous container logs
kubectl logs <pod> -n <ns> --previous

# 4. Check resource usage
kubectl top pod <pod> -n <ns>

# 5. Interactive debug (override entrypoint)
kubectl debug -it <pod> --image=busybox --target=<container> -n <ns>

# 6. Copy debug tools into running container (K8s 1.25+)
kubectl debug -it <pod> --image=nicolaka/netshoot --share-processes --copy-to=debug-pod

TestForge Integration

Once your pods are stable, load testing reveals the next layer of issues: memory pressure under concurrent load, connection pool exhaustion, or GC pauses that trigger liveness probe failures at scale.

TestForge scans your API endpoints, generates realistic load scenarios, and surfaces these issues before your users do — with no manual test script writing required.

A practical hub for operating and improving AI services

Understanding CrashLoopBackOff

Step 1: Read the Exit Code

Step 2: Read Logs — Current and Previous

Root Cause 1: OOMKilled (Exit Code 137) {#oomkilled}

Root Cause 2: Application Startup Failure (Exit Code 1)

Missing environment variables

Database connection failure at startup

Root Cause 3: Wrong Entrypoint / Command Not Found (Exit Code 127)

Root Cause 4: Liveness Probe Killing the Pod (Exit Code 143)

Root Cause 5: Resource Limits Too Low (Throttling → Timeout)

Root Cause 6: Image Pull Errors (Not Exactly CrashLoop, but Common)

Diagnostic Runbook (Quick Reference)

TestForge Integration