TestForge Blog
← All Posts

Redis Failure Analysis — 7 Real-World Failure Patterns in Production

Failure patterns you actually encounter when running Redis in production, and how to diagnose them. Case-by-case solutions for OOM, connection exhaustion, blocked clients, replication lag, and more.

TestForge Team ·

Failure Pattern 1: OOM (Out of Memory)

MISCONF Redis is configured to save RDB snapshots,
but it is currently not able to persist on disk.

Or simply hitting maxmemory and refusing writes.

# Check current memory state
redis-cli info memory | grep -E "used_memory_human|maxmemory_human|mem_fragmentation_ratio"

# Find large keys
redis-cli --bigkeys

# Analyze memory usage of a specific key
redis-cli memory usage <key>

Root Causes:

  • No maxmemory set (unbounded growth)
  • maxmemory-policy: noeviction + write spike
  • Memory fragmentation (fragmentation_ratio > 1.5)

Fix:

maxmemory 4gb
maxmemory-policy allkeys-lru  # For cache use cases

# Defragmentation recovery (Redis 4.0+)
activedefrag yes
active-defrag-ignore-bytes 100mb
active-defrag-threshold-lower 10

Failure Pattern 2: Connection Exhaustion

ERR max number of clients reached
# Current connection count
redis-cli info clients | grep connected_clients

# Connection details (which hosts are connected)
redis-cli client list | awk -F' ' '{print $3}' | sort | uniq -c | sort -rn

Root Causes:

  • Application not returning connections (no connection pool)
  • maxclients default (10000) but OS file descriptor limit is lower

Fix:

maxclients 20000
# Check OS file descriptor limit
ulimit -n
# Set in Redis systemd service
LimitNOFILE=65535

Failure Pattern 3: Slow Commands (Blocked Clients)

# Check slow log (default threshold: 10ms)
redis-cli slowlog get 20
redis-cli slowlog len

# Monitor currently running commands
redis-cli monitor  # Warning: has performance impact in production

Dangerous commands:

  • KEYS * → replace with SCAN
  • SMEMBERS (large Set) → use SSCAN in chunks
  • LRANGE 0 -1 (full range) → limit the range
# Use SCAN instead of KEYS *
redis-cli scan 0 match "user:*" count 100

Failure Pattern 4: Replication Lag

The replica cannot keep up with the master.

redis-cli info replication | grep -E "master_link_status|master_repl_offset|slave_repl_offset"

# Check replication backlog size
redis-cli info replication | grep repl_backlog

Root Causes:

  • Insufficient network bandwidth
  • CPU/IO bottleneck on the replica server
  • repl-backlog-size too small (causes Full Resync on reconnect)

Fix:

repl-backlog-size 256mb  # Default is 1mb — increase it
repl-diskless-sync yes   # Send directly over socket, no disk I/O

Failure Pattern 5: RDB Save Failure

Background save failed
redis-cli info persistence | grep rdb_last_bgsave_status
# rdb_last_bgsave_status:err

Root Causes:

  • Insufficient disk space
  • fork() failure (out of memory, vm.overcommit_memory not set)

Fix:

# Allow Linux memory overcommit
echo 1 > /proc/sys/vm/overcommit_memory
# Persist in /etc/sysctl.conf
vm.overcommit_memory = 1

Failure Pattern 6: Sentinel Failover Not Working

Sentinel is installed but automatic failover doesn’t trigger on master failure.

# Check Sentinel status
redis-cli -p 26379 sentinel masters
redis-cli -p 26379 sentinel slaves mymaster

# Check quorum (requires majority of Sentinels)
redis-cli -p 26379 sentinel ckquorum mymaster

Root Causes:

  • Even number of Sentinels (2 nodes) — can’t reach majority
  • Firewall blocking communication between Sentinels
  • down-after-milliseconds set too high

Failure Pattern 7: Key Expiry Storm

A large number of keys expire simultaneously at the same time, causing a CPU spike.

# Keys expired per second
redis-cli info stats | grep expired_keys

Root Cause: Keys created at the same time share the same TTL.

Fix: Add random jitter to TTLs.

import random

BASE_TTL = 3600  # 1 hour
jitter = random.randint(-300, 300)  # ±5 minutes
redis.setex(key, BASE_TTL + jitter, value)

Essential Monitoring Dashboard Metrics

MetricNormal RangeAlert Threshold
connected_clientsBelow 80% of maxclientsAbove 90%
used_memoryBelow 70% of maxmemoryAbove 85%
keyspace_hits / (hits + misses)Above 85%Below 70%
instantaneous_ops_per_secBased on peakAbnormal spike
rdb_last_bgsave_statusokerr
master_link_statusupdown