[Eval][v2][en] Cache stampede after Redis key expiry on Black Friday morning
service: catalog-svccreated: 6/2/2026, 11:41:31 PM
Raw incident context
Time: 09:00 UTC, Black Friday. catalog-svc latency exploded at exactly 09:00:00 UTC. Symptoms: - catalog-svc p99: 15s (baseline 80ms) - 503 rate: 8% (intermittent during DB overload) - Postgres CPU: 100% sustained, lock waits climbing - Redis CPU: 25% (looks healthy) - Redis cache miss rate for keys matching "catalog:item:*": 95% (baseline 2%) - All product detail page requests are flooding through to DB Background: - We pre-warm the homepage catalog cache every night at 02:00 UTC with TTL=7h - 02:00 UTC + 7h = 09:00 UTC ← all keys expired simultaneously - Black Friday traffic ramp: 12x normal at 09:00 UTC (marketing email blast) - No per-key locking; every cache miss triggers a fresh DB query Logs: - 08:59:58 UTC — last cache hit logged - 09:00:00 UTC — first wave of "cache miss + DB query" log lines - 09:00:02 UTC — DB connection pool hits ceiling, queries start queueing On-call: - 09:02 — pager (multiple alerts: DBHighCPU, CatalogSvcLatencyHigh, CheckoutErrorRate) - 09:03 — diagnosis: cache stampede confirmed by Redis miss-rate metric - 09:05 — debating: extend TTL (won't help in flight), warm cache manually (DB is the bottleneck), or shed load
Summary
At 09:00 UTC on Black Friday, a cache stampede in catalog-svc caused all product detail cache keys to expire simultaneously, flooding Postgres with 12x normal traffic. DB CPU hit 100%, p99 latency spiked to 15s, and intermittent 503s occurred. Redis cache miss rate jumped from 2% to 95%.
Severity reasoning: User-facing outage: p99 latency 15s (baseline 80ms), 503 error rate 8% > 1% for >5 min, revenue path broken (Black Friday traffic, checkout affected).
deepseek-chat·prompt v2·output: en·10347ms·1805↑ / 1680↓ tok·$0.00234
Root cause hypotheses
- highCache TTL of 7h caused all keys to expire at 09:00 UTC, triggering a stampede under 12x traffic.
Evidence: Redis cache miss rate for 'catalog:item:*' jumped from 2% to 95% at 09:00 UTC; last cache hit at 08:59:58 UTC.
- highDB connection pool exhaustion due to concurrent queries from cache misses.
Evidence: DB CPU 100%, lock waits climbing, connection pool ceiling hit at 09:00:02 UTC.
- mediumMissing per-key locking or rate limiting on cache miss regeneration.
Evidence: No per-key locking; every cache miss triggers a fresh DB query, confirmed by logs.
- lowBlack Friday traffic spike alone overwhelmed DB without cache stampede.
Evidence: Traffic is 12x normal, but Redis CPU is only 25%, indicating cache miss is the primary cause.
- lowDatabase query performance degradation (e.g., missing index, slow query).
Evidence: No specific slow query identified; DB CPU is 100% but lock waits suggest contention, not slow queries.
Investigation checklist
- Check Redis cache miss rate for catalog keys.
redis-cli -h redis-cache -p 6379 INFO stats | grep -i 'keyspace_misses'Expected: High miss rate (>90%) confirms stampede.
- Check DB active queries and locks.
SELECT pid, query_start, state, wait_event_type, query FROM pg_stat_activity WHERE state != 'idle' ORDER BY query_start LIMIT 20;Expected: Many queries waiting on locks or running concurrently.
- Check catalog-svc error logs for 503s.
kubectl logs -n prod -l app=catalog-svc --since=15m | grep -i '503' | head -20Expected: Intermittent 503 errors due to DB timeouts.
- Verify cache TTL configuration.
kubectl exec -n prod deploy/catalog-svc -- cat /app/config.yaml | grep -A2 'cache.*ttl'Expected: TTL set to 7h (25200s).
- Check DB connection pool metrics.
kubectl exec -n prod deploy/catalog-svc -- curl localhost:8080/metrics | grep 'db_connections'Expected: Connection pool exhausted (max connections reached).
Mitigation plan
Shed load by temporarily blocking non-critical traffic (e.g., product detail page for non-featured items) via feature flag or rate limiter.
Risk: May degrade user experience for some users; ensure checkout flow is unaffected.
Rollback: Disable the feature flag or increase rate limit back to normal.
Manually warm the cache by running a script that re-queries DB for popular items and sets cache keys with a jittered TTL.
Risk: Additional DB load during warm-up; must be done gradually to avoid further overload.
Rollback: Stop the warm-up script; cache will repopulate naturally over time.
Increase DB connection pool size temporarily (if DB has capacity) to reduce queuing.
Risk: May increase DB CPU further; monitor closely.
Rollback: Revert connection pool size to original value.
Customer impact
Users on Black Friday are experiencing slow product detail page loads (up to 15 seconds) and intermittent errors (503). Checkout may be affected. Estimated impact: all users browsing product pages, ~100% of traffic during peak.
Postmortem draft
Summary
[FILL IN]
Timeline (UTC)
- 08:59:58 — Last cache hit
- 09:00:00 — Cache miss wave begins, DB CPU spikes
- 09:00:02 — DB connection pool exhausted
- 09:02 — Pager alerts
- 09:03 — Cache stampede confirmed
- [FILL IN] — Mitigation actions taken
Impact
- p99 latency 15s (baseline 80ms)
- 503 error rate 8%
- DB CPU 100% for [duration]
Root Cause
Cache TTL of 7h caused all keys to expire at 09:00 UTC, coinciding with Black Friday traffic spike (12x). No per-key locking or rate limiting on cache miss regeneration.
Detection
Alerts: DBHighCPU, CatalogSvcLatencyHigh, CheckoutErrorRate. Redis cache miss rate metric confirmed stampede.
Response
- [FILL IN] Load shedding, cache warm-up, connection pool increase
What Went Well
- Redis monitoring quickly identified cache miss rate
- On-call team diagnosed stampede within 1 minute
What Went Poorly
- Cache TTL not jittered
- No per-key locking or rate limiting for cache misses
- No pre-warming for Black Friday traffic
Action Items
- [FILL IN] Implement jittered TTL, per-key locking, rate limiting, pre-warming strategy
Follow-ups
- P0Implement jittered TTL for cache keys to prevent simultaneous expiration.— service owner
- P0Add per-key locking (e.g., Redis SETNX) for cache miss regeneration.— service owner
- P1Add rate limiting for cache miss queries to DB.— service owner
- P1Set up pre-warming for high-traffic events (e.g., Black Friday).— platform team
- P2Improve alerting to include cache miss rate threshold.— on-call SRE
- P2Review DB connection pool sizing and add auto-scaling.— platform team
Similar past incidents
lexical match (pg_trgm)
- 100%
[Eval][v2][en] Cache stampede after Redis key expiry on Black Friday morning
DB CPU 100%, p99 latency 15s, intermittent 503s, Redis CPU normal but cache miss rate at 95%
- 68%
[Eval][v1][en] Cache stampede after Redis key expiry on Black Friday morning
DB CPU 100%, p99 latency 15s, intermittent 503s, Redis CPU normal but cache miss rate at 95%
- 50%
[Eval][v2][zh] Cache stampede after Redis key expiry on Black Friday morning
DB CPU 100%, p99 latency 15s, intermittent 503s, Redis CPU normal but cache miss rate at 95%
- 49%
[Eval][v1][zh] Cache stampede after Redis key expiry on Black Friday morning
DB CPU 100%, p99 latency 15s, intermittent 503s, Redis CPU normal but cache miss rate at 95%
- 27%
[datadog] checkout p99 8s
checkout p99 8s