Redis Failure Analysis — 7 Real-World Failure Patterns in Production
Failure patterns you actually encounter when running Redis in production, and how to diagnose them. Case-by-case solutions for OOM, connection exhaustion, blocked clients, replication lag, and more.
TestForge Team ·
Failure Pattern 1: OOM (Out of Memory)
MISCONF Redis is configured to save RDB snapshots,
but it is currently not able to persist on disk.
Or simply hitting maxmemory and refusing writes.
# Check current memory state
redis-cli info memory | grep -E "used_memory_human|maxmemory_human|mem_fragmentation_ratio"
# Find large keys
redis-cli --bigkeys
# Analyze memory usage of a specific key
redis-cli memory usage <key>
Root Causes:
- No
maxmemoryset (unbounded growth) maxmemory-policy: noeviction+ write spike- Memory fragmentation (fragmentation_ratio > 1.5)
Fix:
maxmemory 4gb
maxmemory-policy allkeys-lru # For cache use cases
# Defragmentation recovery (Redis 4.0+)
activedefrag yes
active-defrag-ignore-bytes 100mb
active-defrag-threshold-lower 10
Failure Pattern 2: Connection Exhaustion
ERR max number of clients reached
# Current connection count
redis-cli info clients | grep connected_clients
# Connection details (which hosts are connected)
redis-cli client list | awk -F' ' '{print $3}' | sort | uniq -c | sort -rn
Root Causes:
- Application not returning connections (no connection pool)
maxclientsdefault (10000) but OS file descriptor limit is lower
Fix:
maxclients 20000
# Check OS file descriptor limit
ulimit -n
# Set in Redis systemd service
LimitNOFILE=65535
Failure Pattern 3: Slow Commands (Blocked Clients)
# Check slow log (default threshold: 10ms)
redis-cli slowlog get 20
redis-cli slowlog len
# Monitor currently running commands
redis-cli monitor # Warning: has performance impact in production
Dangerous commands:
KEYS *→ replace withSCANSMEMBERS(large Set) → useSSCANin chunksLRANGE 0 -1(full range) → limit the range
# Use SCAN instead of KEYS *
redis-cli scan 0 match "user:*" count 100
Failure Pattern 4: Replication Lag
The replica cannot keep up with the master.
redis-cli info replication | grep -E "master_link_status|master_repl_offset|slave_repl_offset"
# Check replication backlog size
redis-cli info replication | grep repl_backlog
Root Causes:
- Insufficient network bandwidth
- CPU/IO bottleneck on the replica server
repl-backlog-sizetoo small (causes Full Resync on reconnect)
Fix:
repl-backlog-size 256mb # Default is 1mb — increase it
repl-diskless-sync yes # Send directly over socket, no disk I/O
Failure Pattern 5: RDB Save Failure
Background save failed
redis-cli info persistence | grep rdb_last_bgsave_status
# rdb_last_bgsave_status:err
Root Causes:
- Insufficient disk space
fork()failure (out of memory,vm.overcommit_memorynot set)
Fix:
# Allow Linux memory overcommit
echo 1 > /proc/sys/vm/overcommit_memory
# Persist in /etc/sysctl.conf
vm.overcommit_memory = 1
Failure Pattern 6: Sentinel Failover Not Working
Sentinel is installed but automatic failover doesn’t trigger on master failure.
# Check Sentinel status
redis-cli -p 26379 sentinel masters
redis-cli -p 26379 sentinel slaves mymaster
# Check quorum (requires majority of Sentinels)
redis-cli -p 26379 sentinel ckquorum mymaster
Root Causes:
- Even number of Sentinels (2 nodes) — can’t reach majority
- Firewall blocking communication between Sentinels
down-after-millisecondsset too high
Failure Pattern 7: Key Expiry Storm
A large number of keys expire simultaneously at the same time, causing a CPU spike.
# Keys expired per second
redis-cli info stats | grep expired_keys
Root Cause: Keys created at the same time share the same TTL.
Fix: Add random jitter to TTLs.
import random
BASE_TTL = 3600 # 1 hour
jitter = random.randint(-300, 300) # ±5 minutes
redis.setex(key, BASE_TTL + jitter, value)
Essential Monitoring Dashboard Metrics
| Metric | Normal Range | Alert Threshold |
|---|---|---|
connected_clients | Below 80% of maxclients | Above 90% |
used_memory | Below 70% of maxmemory | Above 85% |
keyspace_hits / (hits + misses) | Above 85% | Below 70% |
instantaneous_ops_per_sec | Based on peak | Abnormal spike |
rdb_last_bgsave_status | ok | err |
master_link_status | up | down |