[Eval][v3][en] Cache stampede after Redis key expiry on Black Friday morning
service: catalog-svccreated: 6/2/2026, 11:42:07 PM
Raw incident context
Time: 09:00 UTC, Black Friday. catalog-svc latency exploded at exactly 09:00:00 UTC. Symptoms: - catalog-svc p99: 15s (baseline 80ms) - 503 rate: 8% (intermittent during DB overload) - Postgres CPU: 100% sustained, lock waits climbing - Redis CPU: 25% (looks healthy) - Redis cache miss rate for keys matching "catalog:item:*": 95% (baseline 2%) - All product detail page requests are flooding through to DB Background: - We pre-warm the homepage catalog cache every night at 02:00 UTC with TTL=7h - 02:00 UTC + 7h = 09:00 UTC ← all keys expired simultaneously - Black Friday traffic ramp: 12x normal at 09:00 UTC (marketing email blast) - No per-key locking; every cache miss triggers a fresh DB query Logs: - 08:59:58 UTC — last cache hit logged - 09:00:00 UTC — first wave of "cache miss + DB query" log lines - 09:00:02 UTC — DB connection pool hits ceiling, queries start queueing On-call: - 09:02 — pager (multiple alerts: DBHighCPU, CatalogSvcLatencyHigh, CheckoutErrorRate) - 09:03 — diagnosis: cache stampede confirmed by Redis miss-rate metric - 09:05 — debating: extend TTL (won't help in flight), warm cache manually (DB is the bottleneck), or shed load
Summary
At 09:00 UTC on Black Friday, a cache stampede caused catalog-svc p99 latency to spike to 15s and intermittent 503 errors. Pre-warmed Redis keys with 7h TTL expired simultaneously, coinciding with a 12x traffic surge from a marketing email blast. Every cache miss triggered a DB query, saturating Postgres CPU at 100% and exhausting the connection pool. Redis CPU remained healthy but cache miss rate hit 95%.
Severity reasoning: User-facing outage with error rate >1% (8% 503s) for >5 min, revenue path broken (product detail pages failing), and regional unavailability risk. Matches SEV1 rubric: error rate >1% for >5 min and revenue path broken.
deepseek-chat·prompt v3·output: en·14749ms·2056↑ / 2291↓ tok·$0.00308
Root cause hypotheses
- highCache stampede due to simultaneous key expiration at 09:00 UTC
Evidence: Redis cache miss rate for 'catalog:item:*' keys jumped from 2% to 95% at 09:00 UTC. Pre-warm TTL=7h from 02:00 UTC expired exactly at 09:00 UTC. Logs show last cache hit at 08:59:58, first miss at 09:00:00.
- highDB connection pool exhaustion from concurrent cache miss queries
Evidence: Postgres CPU 100% sustained, lock waits climbing, connection pool hit ceiling at 09:00:02 UTC. Queries started queueing. 8% 503s indicate pool exhaustion.
- highBlack Friday traffic surge amplified cache miss impact
Evidence: Traffic ramp 12x normal at 09:00 UTC due to marketing email blast. Normal cache hit ratio would have absorbed this, but with 95% miss rate, all traffic hit DB.
- mediumMissing per-key locking or request coalescing for cache misses
Evidence: No per-key locking; every cache miss triggers a fresh DB query. This is a design gap that allowed stampede to overwhelm DB.
- lowDB query performance degradation due to lock contention
Evidence: Lock waits climbing indicates queries are blocking each other. Could be due to heavy read load or inefficient queries, but primary cause is volume.
Investigation checklist
- Check Redis cache miss rate for catalog keys
redis-cli -h redis-cache -p 6379 INFO stats | grep -i 'keyspace_misses' && redis-cli -h redis-cache -p 6379 --raw GET 'catalog:item:*' | head -5Expected: Miss rate near 100% for catalog keys; keys missing from cache.
- Check Postgres active queries and wait events
psql -h pg-primary -U app -c "SELECT pid, query_start, state, wait_event_type, wait_event, query FROM pg_stat_activity WHERE state != 'idle' ORDER BY query_start LIMIT 20;"Expected: Many queries waiting on 'LWLock' or 'IO' with long query_start times.
- Check DB connection pool utilization
psql -h pg-primary -U app -c "SELECT count(*) FROM pg_stat_activity WHERE state = 'active';" && psql -h pg-primary -U app -c "SHOW max_connections;"Expected: Active connections near max_connections (e.g., 100/100).
- Check catalog-svc error logs for 503s
kubectl logs -n prod -l app=catalog-svc --since=15m | grep -i '503' | tail -20Expected: Lines indicating 'connection refused' or 'timeout' from DB.
- Verify cache TTL configuration
kubectl exec -n prod deploy/catalog-svc -- cat /app/config.yaml | grep -A2 'cache.*ttl'Expected: TTL set to 7h for catalog keys.
- Check Redis CPU and memory
redis-cli -h redis-cache -p 6379 INFO cpu | grep 'used_cpu_sys' && redis-cli -h redis-cache -p 6379 INFO memory | grep 'used_memory_human'Expected: CPU low (~25%), memory usage normal.
Mitigation plan
Shed load by temporarily blocking non-critical traffic (e.g., marketing redirects) at the load balancer or via feature flag.
Risk: May affect marketing campaigns but preserves core catalog functionality.
Rollback: Remove the block rule or disable feature flag.
Manually warm the cache by running a script to re-populate catalog keys from DB, but with rate limiting to avoid further DB overload.
Risk: If rate limit is too high, DB may become more saturated. Safer to run with low concurrency.
Rollback: Stop the script; keys will expire naturally.
Increase DB connection pool size temporarily (e.g., from 100 to 200) to reduce queuing.
Risk: May increase DB CPU further; monitor closely. Could cause OOM if memory is tight.
Rollback: Revert to original pool size.
Extend cache TTL for existing keys (e.g., set TTL to 24h) to prevent immediate re-expiry.
Risk: Does not help in-flight requests; only prevents future stampedes. May cause stale data.
Rollback: Reset TTL to original value.
Enable request coalescing for cache misses (e.g., using Redis SETNX or a mutex) to deduplicate DB queries.
Risk: Requires code change; not immediate. Could introduce latency if not implemented correctly.
Rollback: Revert code change.
Customer impact
Users are experiencing slow loading and intermittent errors (503) when viewing product detail pages. Approximately 8% of requests are failing, and page load times have increased from under 100ms to over 15 seconds. This affects all users during the Black Friday sales event.
Postmortem draft
Postmortem: catalog-svc Cache Stampede
Summary
On Black Friday at 09:00 UTC, a cache stampede caused catalog-svc p99 latency to spike to 15s and 8% error rate. Pre-warmed Redis keys expired simultaneously, coinciding with a 12x traffic surge. DB CPU hit 100% and connection pool exhausted.
Timeline (UTC)
- 02:00 — Pre-warm cache job runs, sets TTL=7h
- 08:59:58 — Last cache hit
- 09:00:00 — Cache miss rate jumps to 95%, DB queries flood
- 09:00:02 — DB connection pool exhausted
- 09:02 — Pager alerts received
- 09:03 — Diagnosis: cache stampede
- [FILL IN] — Mitigation actions taken
- [FILL IN] — Incident resolved
Impact
- p99 latency: 15s (baseline 80ms)
- Error rate: 8% 503s
- Affected users: all catalog-svc users during Black Friday
- Revenue impact: [FILL IN]
Root Cause
Simultaneous expiration of pre-warmed cache keys at 09:00 UTC combined with 12x traffic surge caused a cache stampede. No per-key locking or request coalescing allowed all cache misses to hit the database simultaneously, overwhelming Postgres.
Detection
Alerts: DBHighCPU, CatalogSvcLatencyHigh, CheckoutErrorRate. Detected within 2 minutes via pager.
Response
- [FILL IN] actions taken
- [FILL IN] time to mitigate
What Went Well
- Alerting worked and paged quickly
- On-call identified stampede within 1 minute
What Went Poorly
- No pre-warm key staggering or jitter
- No request coalescing for cache misses
- No load shedding mechanism for cache stampede
Action Items
- [ ] Add jitter to cache TTLs to prevent simultaneous expiry
- [ ] Implement request coalescing (e.g., SETNX) for cache misses
- [ ] Add load shedding at load balancer for cache stampede scenarios
- [ ] Review and improve pre-warm strategy
- [ ] Add monitoring for cache miss rate spikes
Follow-ups
- P0Add jitter to cache TTLs to prevent simultaneous key expiration— catalog-svc team
- P0Implement request coalescing for cache misses using Redis SETNX or similar— catalog-svc team
- P1Add load shedding at load balancer for cache stampede scenarios— platform team
- P1Review and improve pre-warm strategy (staggered refresh, shorter TTL with background refresh)— catalog-svc team
- P1Add monitoring and alert for cache miss rate spikes— observability team
- P2Document cache stampede runbook and mitigation steps— on-call SRE
Similar past incidents
lexical match (pg_trgm)
- 83%
[Eval][v1][en] Cache stampede after Redis key expiry on Black Friday morning
DB CPU 100%, p99 latency 15s, intermittent 503s, Redis CPU normal but cache miss rate at 95%
- 63%
[Eval][v2][en] Cache stampede after Redis key expiry on Black Friday morning
DB CPU 100%, p99 latency 15s, intermittent 503s, Redis CPU normal but cache miss rate at 95%
- 41%
[Eval][v2][zh] Cache stampede after Redis key expiry on Black Friday morning
DB CPU 100%, p99 latency 15s, intermittent 503s, Redis CPU normal but cache miss rate at 95%
- 41%
[Eval][v1][zh] Cache stampede after Redis key expiry on Black Friday morning
DB CPU 100%, p99 latency 15s, intermittent 503s, Redis CPU normal but cache miss rate at 95%
- 30%
[Eval][v2][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts