Redis Failure Analysis — 7 Real-World Failure Patterns in Production

Failure Pattern 1: OOM (Out of Memory)

MISCONF Redis is configured to save RDB snapshots,
but it is currently not able to persist on disk.

Or simply hitting maxmemory and refusing writes.

# Check current memory state
redis-cli info memory | grep -E "used_memory_human|maxmemory_human|mem_fragmentation_ratio"

# Find large keys
redis-cli --bigkeys

# Analyze memory usage of a specific key
redis-cli memory usage <key>

Root Causes:

No maxmemory set (unbounded growth)
maxmemory-policy: noeviction + write spike
Memory fragmentation (fragmentation_ratio > 1.5)

Fix:

maxmemory 4gb
maxmemory-policy allkeys-lru  # For cache use cases

# Defragmentation recovery (Redis 4.0+)
activedefrag yes
active-defrag-ignore-bytes 100mb
active-defrag-threshold-lower 10

Failure Pattern 2: Connection Exhaustion

ERR max number of clients reached

# Current connection count
redis-cli info clients | grep connected_clients

# Connection details (which hosts are connected)
redis-cli client list | awk -F' ' '{print $3}' | sort | uniq -c | sort -rn

Root Causes:

Application not returning connections (no connection pool)
maxclients default (10000) but OS file descriptor limit is lower

Fix:

maxclients 20000

# Check OS file descriptor limit
ulimit -n
# Set in Redis systemd service
LimitNOFILE=65535

Failure Pattern 3: Slow Commands (Blocked Clients)

# Check slow log (default threshold: 10ms)
redis-cli slowlog get 20
redis-cli slowlog len

# Monitor currently running commands
redis-cli monitor  # Warning: has performance impact in production

Dangerous commands:

KEYS * → replace with SCAN
SMEMBERS (large Set) → use SSCAN in chunks
LRANGE 0 -1 (full range) → limit the range

# Use SCAN instead of KEYS *
redis-cli scan 0 match "user:*" count 100

Failure Pattern 4: Replication Lag

The replica cannot keep up with the master.

redis-cli info replication | grep -E "master_link_status|master_repl_offset|slave_repl_offset"

# Check replication backlog size
redis-cli info replication | grep repl_backlog

Root Causes:

Insufficient network bandwidth
CPU/IO bottleneck on the replica server
repl-backlog-size too small (causes Full Resync on reconnect)

Fix:

repl-backlog-size 256mb  # Default is 1mb — increase it
repl-diskless-sync yes   # Send directly over socket, no disk I/O

Failure Pattern 5: RDB Save Failure

Background save failed

redis-cli info persistence | grep rdb_last_bgsave_status
# rdb_last_bgsave_status:err

Root Causes:

Insufficient disk space
fork() failure (out of memory, vm.overcommit_memory not set)

Fix:

# Allow Linux memory overcommit
echo 1 > /proc/sys/vm/overcommit_memory
# Persist in /etc/sysctl.conf
vm.overcommit_memory = 1

Failure Pattern 6: Sentinel Failover Not Working

Sentinel is installed but automatic failover doesn’t trigger on master failure.

# Check Sentinel status
redis-cli -p 26379 sentinel masters
redis-cli -p 26379 sentinel slaves mymaster

# Check quorum (requires majority of Sentinels)
redis-cli -p 26379 sentinel ckquorum mymaster

Root Causes:

Even number of Sentinels (2 nodes) — can’t reach majority
Firewall blocking communication between Sentinels
down-after-milliseconds set too high

Failure Pattern 7: Key Expiry Storm

A large number of keys expire simultaneously at the same time, causing a CPU spike.

# Keys expired per second
redis-cli info stats | grep expired_keys

Root Cause: Keys created at the same time share the same TTL.

Fix: Add random jitter to TTLs.

import random

BASE_TTL = 3600  # 1 hour
jitter = random.randint(-300, 300)  # ±5 minutes
redis.setex(key, BASE_TTL + jitter, value)

Essential Monitoring Dashboard Metrics

Metric	Normal Range	Alert Threshold
`connected_clients`	Below 80% of maxclients	Above 90%
`used_memory`	Below 70% of maxmemory	Above 85%
`keyspace_hits / (hits + misses)`	Above 85%	Below 70%
`instantaneous_ops_per_sec`	Based on peak	Abnormal spike
`rdb_last_bgsave_status`	ok	err
`master_link_status`	up	down

A practical hub for operating and improving AI services

Failure Pattern 1: OOM (Out of Memory)

Failure Pattern 2: Connection Exhaustion

Failure Pattern 3: Slow Commands (Blocked Clients)

Failure Pattern 4: Replication Lag

Failure Pattern 5: RDB Save Failure

Failure Pattern 6: Sentinel Failover Not Working

Failure Pattern 7: Key Expiry Storm

Essential Monitoring Dashboard Metrics