[Eval][v2][en] Cache stampede after Redis key expiry on Black Friday morning
service: catalog-svccreated: 6/3/2026, 12:08:14 AM
Raw incident context
Time: 09:00 UTC, Black Friday. catalog-svc latency exploded at exactly 09:00:00 UTC. Symptoms: - catalog-svc p99: 15s (baseline 80ms) - 503 rate: 8% (intermittent during DB overload) - Postgres CPU: 100% sustained, lock waits climbing - Redis CPU: 25% (looks healthy) - Redis cache miss rate for keys matching "catalog:item:*": 95% (baseline 2%) - All product detail page requests are flooding through to DB Background: - We pre-warm the homepage catalog cache every night at 02:00 UTC with TTL=7h - 02:00 UTC + 7h = 09:00 UTC ← all keys expired simultaneously - Black Friday traffic ramp: 12x normal at 09:00 UTC (marketing email blast) - No per-key locking; every cache miss triggers a fresh DB query Logs: - 08:59:58 UTC — last cache hit logged - 09:00:00 UTC — first wave of "cache miss + DB query" log lines - 09:00:02 UTC — DB connection pool hits ceiling, queries start queueing On-call: - 09:02 — pager (multiple alerts: DBHighCPU, CatalogSvcLatencyHigh, CheckoutErrorRate) - 09:03 — diagnosis: cache stampede confirmed by Redis miss-rate metric - 09:05 — debating: extend TTL (won't help in flight), warm cache manually (DB is the bottleneck), or shed load
Summary
At 09:00 UTC on Black Friday, a cache stampede in catalog-svc caused all product detail cache keys to expire simultaneously, flooding Postgres with 12x normal traffic. DB CPU hit 100%, p99 latency spiked to 15s, and intermittent 503s occurred. Redis cache miss rate jumped from 2% to 95%.
Severity reasoning: User-facing outage: p99 latency 15s (baseline 80ms), 503 error rate 8% > 1% for >5 min, revenue path broken (Black Friday traffic). Meets SEV1 criteria.
deepseek-chat·prompt v2·output: en·11826ms·1805↑ / 1529↓ tok·$0.00217
Root cause hypotheses
- highCache TTL of 7h caused all keys to expire at 09:00 UTC, triggering a stampede under 12x traffic.
Evidence: Redis cache miss rate for 'catalog:item:*' jumped from 2% to 95% at 09:00 UTC; last cache hit at 08:59:58 UTC.
- highDB connection pool exhausted by concurrent queries from cache misses, leading to queueing and latency.
Evidence: DB CPU 100% sustained, lock waits climbing; connection pool hit ceiling at 09:00:02 UTC.
- mediumMissing per-key locking or request coalescing allowed multiple requests to query DB for the same key.
Evidence: No per-key locking in code; every cache miss triggers a fresh DB query.
- lowBlack Friday traffic ramp (12x) exceeded system capacity even with cache hits.
Evidence: Traffic ramp from marketing email blast; baseline cache hit rate would still result in high load.
Investigation checklist
- Check Redis cache miss rate for catalog keys.
redis-cli -h redis-cache -p 6379 INFO stats | grep -i 'keyspace_misses'Expected: High miss rate (>90%) confirms stampede.
- Check Postgres active queries and locks.
SELECT pid, query_start, state, wait_event_type, query FROM pg_stat_activity WHERE state != 'idle' ORDER BY query_start LIMIT 20;Expected: Many queries waiting on locks or running long (>1s).
- Check catalog-svc error rate and latency.
kubectl logs -n prod -l app=catalog-svc --since=10m | grep -E '503|5[0-9][0-9]' | tail -20Expected: High rate of 503 responses.
- Verify cache TTL configuration.
kubectl exec -n prod deploy/catalog-svc -- cat /app/config.yaml | grep -A2 'cache.*ttl'Expected: TTL=7h for catalog:item keys.
- Check if per-key locking exists in code.
grep -r 'lock.*cache' /path/to/catalog-svc/src/Expected: No locking mechanism found.
Mitigation plan
Immediately increase Redis cache TTL for catalog:item keys to 24h to prevent near-term re-expiry.
Risk: Stale data served for up to 24h; acceptable during Black Friday.
Rollback: Revert TTL to 7h after incident.
Manually warm cache by running a script to re-query popular items from DB and set in Redis with new TTL.
Risk: Additional load on DB; must throttle to avoid worsening stampede.
Rollback: Stop the warm script; DB load will decrease as cache fills naturally.
Enable request coalescing for cache misses (e.g., only one request per key queries DB, others wait).
Risk: Requires code change; cannot deploy quickly. Consider feature flag.
Rollback: Disable feature flag.
Customer impact
Customers browsing product details on Black Friday experience slow page loads (up to 15 seconds) and intermittent errors (503). All users are affected. Estimated impact: 100% of traffic during the incident.
Postmortem draft
Summary
[FILL IN]
Timeline (UTC)
- 08:59:58 — Last cache hit
- 09:00:00 — Cache miss wave begins
- 09:00:02 — DB connection pool exhausted
- 09:02 — Pager alerts
- 09:03 — Cache stampede confirmed
- [FILL IN] — Mitigation actions
Impact
- p99 latency 15s (baseline 80ms)
- 503 error rate 8%
- DB CPU 100%
Root Cause
Cache TTL of 7h caused all catalog:item keys to expire at 09:00 UTC, coinciding with 12x Black Friday traffic, leading to a stampede.
Detection
Alerts: DBHighCPU, CatalogSvcLatencyHigh, CheckoutErrorRate.
Response
- Increased TTL to 24h
- Manual cache warm (throttled)
- [FILL IN]
What Went Well
- Redis remained healthy
- On-call quickly identified stampede
What Went Poorly
- No per-key locking
- Cache TTL not staggered
- No pre-warming for Black Friday
Action Items
- [FILL IN]
Follow-ups
- P0Implement per-key locking or request coalescing for cache misses.— catalog-svc team
- P1Stagger cache TTLs to prevent mass expiration (e.g., add jitter).— platform team
- P1Add alert on cache miss rate spike >50%.— observability team
- P1Pre-warm cache before major traffic events (Black Friday).— SRE team
- P2Review DB connection pool sizing and add queueing limits.— platform team
Similar past incidents
lexical match (pg_trgm)
- 100%
[Eval][v2][en] Cache stampede after Redis key expiry on Black Friday morning
DB CPU 100%, p99 latency 15s, intermittent 503s, Redis CPU normal but cache miss rate at 95%
- 68%
[Eval][v1][en] Cache stampede after Redis key expiry on Black Friday morning
DB CPU 100%, p99 latency 15s, intermittent 503s, Redis CPU normal but cache miss rate at 95%
- 50%
[Eval][v2][zh] Cache stampede after Redis key expiry on Black Friday morning
DB CPU 100%, p99 latency 15s, intermittent 503s, Redis CPU normal but cache miss rate at 95%
- 49%
[Eval][v1][zh] Cache stampede after Redis key expiry on Black Friday morning
DB CPU 100%, p99 latency 15s, intermittent 503s, Redis CPU normal but cache miss rate at 95%
- 27%
[datadog] checkout p99 8s
checkout p99 8s