[Eval][v1][en] Cache stampede after Redis key expiry on Black Friday morning
service: catalog-svccreated: 6/3/2026, 12:05:46 AM
Raw incident context
Time: 09:00 UTC, Black Friday. catalog-svc latency exploded at exactly 09:00:00 UTC. Symptoms: - catalog-svc p99: 15s (baseline 80ms) - 503 rate: 8% (intermittent during DB overload) - Postgres CPU: 100% sustained, lock waits climbing - Redis CPU: 25% (looks healthy) - Redis cache miss rate for keys matching "catalog:item:*": 95% (baseline 2%) - All product detail page requests are flooding through to DB Background: - We pre-warm the homepage catalog cache every night at 02:00 UTC with TTL=7h - 02:00 UTC + 7h = 09:00 UTC ← all keys expired simultaneously - Black Friday traffic ramp: 12x normal at 09:00 UTC (marketing email blast) - No per-key locking; every cache miss triggers a fresh DB query Logs: - 08:59:58 UTC — last cache hit logged - 09:00:00 UTC — first wave of "cache miss + DB query" log lines - 09:00:02 UTC — DB connection pool hits ceiling, queries start queueing On-call: - 09:02 — pager (multiple alerts: DBHighCPU, CatalogSvcLatencyHigh, CheckoutErrorRate) - 09:03 — diagnosis: cache stampede confirmed by Redis miss-rate metric - 09:05 — debating: extend TTL (won't help in flight), warm cache manually (DB is the bottleneck), or shed load
Summary
At 09:00 UTC on Black Friday, a cache stampede in catalog-svc caused all product detail cache keys to expire simultaneously, flooding Postgres with 12x normal traffic. DB CPU hit 100%, p99 latency rose to 15s, and 8% of requests returned 503. Redis is healthy but cache miss rate is 95%.
Severity reasoning: User-facing outage with 8% error rate and 15s latency, revenue impact on Black Friday, and potential for cascading failures to checkout service.
deepseek-chat·prompt v1·output: en·10568ms·1424↑ / 1479↓ tok·$0.00201
Root cause hypotheses
- highCache TTL of 7h caused all catalog:item:* keys to expire at 09:00 UTC, coinciding with Black Friday traffic ramp.
Evidence: Cache miss rate jumped from 2% to 95% at 09:00 UTC; last cache hit at 08:59:58 UTC; pre-warm at 02:00 UTC + 7h = 09:00 UTC.
- highNo per-key locking or request coalescing allowed all concurrent requests to trigger DB queries on cache miss.
Evidence: Logs show 'cache miss + DB query' for every request at 09:00 UTC; DB connection pool saturated immediately.
- mediumBlack Friday traffic spike (12x normal) from marketing email blast overwhelmed the DB.
Evidence: Traffic ramp at 09:00 UTC; DB CPU 100% sustained; lock waits climbing.
Investigation checklist
- Check Redis cache miss rate for catalog:item:* keys
redis-cli INFO stats | grep keyspace_missesExpected: keyspace_misses_hits_rate > 90% confirms cache stampede
- Check Postgres active queries and lock waits
SELECT pid, state, wait_event, query FROM pg_stat_activity WHERE state = 'active' ORDER BY query_start;Expected: Many queries on catalog_items table with wait_event = 'LWLock' or 'Lock'
- Check catalog-svc error rate and latency
kubectl -n production top pods -l app=catalog-svc; curl -s http://catalog-svc:8080/metrics | grep -E 'http_request_duration_seconds|http_requests_total'Expected: p99 > 10s, 503 count increasing
- Verify cache TTL configuration
kubectl -n production exec deploy/catalog-svc -- cat /app/config.yaml | grep -A2 'cache.*ttl'Expected: ttl: 7h or 25200s
Mitigation plan
Enable request coalescing for cache misses (e.g., singleflight pattern) to deduplicate concurrent DB queries for the same key.
Risk: May increase latency for the first request if DB is slow; no destructive operations.
Rollback: Disable coalescing via feature flag or config change.
Manually warm the cache by running a script that queries popular items and sets Redis keys with a staggered TTL (e.g., 6h + random offset).
Risk: Script may add load to DB; run with rate limiting and only for top-N items.
Rollback: Stop the script; keys will expire naturally.
If DB remains saturated, shed load by returning stale cached data (if available) or serving a degraded response for non-critical items.
Risk: Users may see stale data; acceptable during Black Friday.
Rollback: Revert to normal serving once DB load drops.
Customer impact
Customers on Black Friday are experiencing 15s load times and intermittent 503 errors when viewing product detail pages. Checkout may be affected due to catalog-svc dependency. No ETA yet.
Postmortem draft
Postmortem: Cache Stampede on catalog-svc (Black Friday)
Summary: At 09:00 UTC on Black Friday, a cache stampede caused all catalog:item:* keys to expire simultaneously, flooding Postgres with 12x traffic. DB CPU hit 100%, p99 latency 15s, 8% 503s.
Timeline:
- 02:00 UTC: Cache pre-warm with TTL=7h
- 08:59:58: Last cache hit
- 09:00:00: Cache miss wave begins
- 09:00:02: DB connection pool saturated
- 09:02: Pager alerts
- 09:03: Diagnosis: cache stampede
- 09:05: Mitigation started
Impact: 8% error rate, 15s latency for ~12x normal traffic, potential checkout impact.
Root Cause: Cache TTL alignment with traffic spike, no request coalescing.
What Went Well: Quick diagnosis via Redis miss rate metric.
What Went Poorly: No staggered TTL, no request coalescing, no load shedding plan.
Action Items:
- [ ] Implement request coalescing (singleflight) for cache misses
- [ ] Add jitter to cache TTL (e.g., 7h ± 30m)
- [ ] Add circuit breaker for DB queries
- [ ] Create runbook for cache stampede
Follow-ups
- P0Implement request coalescing (singleflight) for cache misses— service owner
- P0Add jitter to cache TTL to prevent simultaneous expiration— service owner
- P1Add circuit breaker for DB queries to protect from overload— platform team
- P1Create runbook for cache stampede incident response— on-call SRE
- P2Review Black Friday traffic patterns and pre-warm strategy— service owner
Similar past incidents
lexical match (pg_trgm)
- 81%
[Eval][v2][en] Cache stampede after Redis key expiry on Black Friday morning
DB CPU 100%, p99 latency 15s, intermittent 503s, Redis CPU normal but cache miss rate at 95%
- 65%
[Eval][v1][en] Cache stampede after Redis key expiry on Black Friday morning
DB CPU 100%, p99 latency 15s, intermittent 503s, Redis CPU normal but cache miss rate at 95%
- 49%
[Eval][v1][zh] Cache stampede after Redis key expiry on Black Friday morning
DB CPU 100%, p99 latency 15s, intermittent 503s, Redis CPU normal but cache miss rate at 95%
- 48%
[Eval][v2][zh] Cache stampede after Redis key expiry on Black Friday morning
DB CPU 100%, p99 latency 15s, intermittent 503s, Redis CPU normal but cache miss rate at 95%
- 26%
[Eval][v1][en] Order service OOM crashloop following v3.7 deploy
Pods OOM-killed every ~20min, restart loop, p99 latency degraded, ~3% requests timing out