[Eval][v1][en] Cache stampede after Redis key expiry on Black Friday morning
service: catalog-svccreated: 6/2/2026, 11:41:01 PM
Raw incident context
Time: 09:00 UTC, Black Friday. catalog-svc latency exploded at exactly 09:00:00 UTC. Symptoms: - catalog-svc p99: 15s (baseline 80ms) - 503 rate: 8% (intermittent during DB overload) - Postgres CPU: 100% sustained, lock waits climbing - Redis CPU: 25% (looks healthy) - Redis cache miss rate for keys matching "catalog:item:*": 95% (baseline 2%) - All product detail page requests are flooding through to DB Background: - We pre-warm the homepage catalog cache every night at 02:00 UTC with TTL=7h - 02:00 UTC + 7h = 09:00 UTC ← all keys expired simultaneously - Black Friday traffic ramp: 12x normal at 09:00 UTC (marketing email blast) - No per-key locking; every cache miss triggers a fresh DB query Logs: - 08:59:58 UTC — last cache hit logged - 09:00:00 UTC — first wave of "cache miss + DB query" log lines - 09:00:02 UTC — DB connection pool hits ceiling, queries start queueing On-call: - 09:02 — pager (multiple alerts: DBHighCPU, CatalogSvcLatencyHigh, CheckoutErrorRate) - 09:03 — diagnosis: cache stampede confirmed by Redis miss-rate metric - 09:05 — debating: extend TTL (won't help in flight), warm cache manually (DB is the bottleneck), or shed load
Summary
At 09:00 UTC on Black Friday, a cache stampede in catalog-svc caused all product detail page requests to hit Postgres simultaneously after a pre-warmed cache (TTL=7h) expired. Combined with 12x normal traffic from a marketing email blast, DB CPU reached 100%, p99 latency spiked to 15s, and 8% of requests returned 503. Redis cache miss rate jumped from 2% to 95%.
Severity reasoning: User-facing outage with 8% error rate and 15s latency, revenue impact on Black Friday, and potential for cascading failures to downstream services (e.g., checkout).
deepseek-chat·prompt v1·output: en·11678ms·1424↑ / 1826↓ tok·$0.00239
Root cause hypotheses
- highCache TTL expiration at 09:00 UTC caused all keys to expire simultaneously, triggering a stampede.
Evidence: Redis cache miss rate for 'catalog:item:*' jumped from 2% to 95% at 09:00 UTC; last cache hit at 08:59:58 UTC.
- highDB connection pool exhaustion due to concurrent queries from cache misses.
Evidence: DB CPU 100%, lock waits climbing, connection pool hitting ceiling at 09:00:02 UTC.
- mediumBlack Friday traffic spike (12x normal) overwhelmed the DB even without cache stampede.
Evidence: Marketing email blast at 09:00 UTC; traffic ramp confirmed.
- highMissing per-key locking or request coalescing in catalog-svc.
Evidence: No per-key locking; every cache miss triggers a fresh DB query.
- lowDatabase query performance degradation due to missing index or plan change.
Evidence: No evidence of plan changes; baseline latency was 80ms before stampede.
Investigation checklist
- Check Redis cache miss rate for catalog:item:* keys.
redis-cli INFO stats | grep keyspace_misses; redis-cli --raw KEYS 'catalog:item:*' | wc -lExpected: High miss rate (95%) and low key count (should be ~12x normal).
- Check Postgres active queries and lock waits.
SELECT pid, state, wait_event, query FROM pg_stat_activity WHERE state = 'active' ORDER BY query_start;Expected: Many queries waiting on 'relation' or 'tuple' locks, all selecting from catalog_items.
- Check DB connection pool usage.
SELECT count(*) FROM pg_stat_activity;Expected: Count near max_connections (e.g., 200).
- Check catalog-svc error logs for 503s.
kubectl logs -l app=catalog-svc --tail=1000 | grep '503' | head -20Expected: Lines like 'http: response 503' with 'connection refused' or 'timeout'.
- Verify cache TTL configuration.
kubectl exec deploy/catalog-svc -- cat /app/config.yaml | grep -A2 'cache'Expected: TTL: 7h, pre-warm at 02:00 UTC.
- Check if any feature flag can disable catalog-svc or serve stale cache.
kubectl exec deploy/catalog-svc -- curl -s localhost:8080/health | jq '.features'Expected: Feature flags like 'stale_cache_enabled' or 'catalog_fallback'.
Mitigation plan
Enable stale cache serving (serve stale data while refreshing in background) if feature exists.
Risk: Users may see slightly outdated product info (price, stock).
Rollback: Disable stale cache feature flag.
Manually warm Redis cache with a script that reads from a DB replica (not primary) with rate limiting.
Risk: Script could overload DB replica; must use throttling (e.g., 100 queries/sec).
Rollback: Kill the script; revert to current state.
Increase DB connection pool and add read replicas for catalog queries.
Risk: Temporary increase may mask underlying issue; replicas may lag.
Rollback: Reduce pool size and remove replicas after incident.
Shed load by rate-limiting catalog-svc at ingress (e.g., 50% of traffic) or redirect to a static fallback page.
Risk: Some users will see degraded experience or error page.
Rollback: Remove rate limit or redirect.
Customer impact
All users browsing product detail pages on Black Friday are experiencing 15-second load times and intermittent errors (8% 503s). Checkout may be indirectly affected if catalog-svc is a dependency. No data loss expected.
Postmortem draft
Postmortem: Catalog Cache Stampede on Black Friday
Summary: At 09:00 UTC, a pre-warmed cache with 7h TTL expired simultaneously, causing a stampede to Postgres. Combined with 12x traffic, DB CPU hit 100%, p99 latency 15s, 8% 503s.
Timeline:
- 02:00 UTC: Cache pre-warmed (TTL=7h)
- 09:00 UTC: Cache expires, traffic spike from marketing email
- 09:02 UTC: Pager alerts
- 09:03 UTC: Diagnosis: cache stampede
- 09:05 UTC: Mitigation actions started
Impact: 8% error rate, 15s latency for ~30 minutes. Revenue impact TBD.
Root Cause: Cache TTL expiration without jitter or per-key locking, combined with traffic spike.
What Went Well:
- Monitoring detected issue within 2 minutes
- Redis metrics clearly showed cache miss rate
What Went Poorly:
- No stale cache serving
- No request coalescing
- TTL alignment with traffic spike
Action Items:
- [ ] Add jitter to cache TTL (P0)
- [ ] Implement per-key locking / request coalescing (P0)
- [ ] Enable stale cache serving (P1)
- [ ] Add load shedding at ingress (P1)
- [ ] Review Black Friday traffic patterns (P2)
Follow-ups
- P0Add random jitter to cache TTL to prevent simultaneous expiration.— service owner
- P0Implement per-key locking or request coalescing for cache misses.— service owner
- P1Enable stale cache serving with background refresh.— service owner
- P1Add ingress rate limiting and load shedding for catalog-svc.— platform team
- P2Review and adjust pre-warm schedule to avoid peak traffic alignment.— service owner
- P1Set up alert on cache miss rate spike.— on-call SRE
- P2Document incident response steps for cache stampede.— on-call SRE
Similar past incidents
lexical match (pg_trgm)
- 77%
[Eval][v2][en] Cache stampede after Redis key expiry on Black Friday morning
DB CPU 100%, p99 latency 15s, intermittent 503s, Redis CPU normal but cache miss rate at 95%
- 72%
[Eval][v1][en] Cache stampede after Redis key expiry on Black Friday morning
DB CPU 100%, p99 latency 15s, intermittent 503s, Redis CPU normal but cache miss rate at 95%
- 45%
[Eval][v1][zh] Cache stampede after Redis key expiry on Black Friday morning
DB CPU 100%, p99 latency 15s, intermittent 503s, Redis CPU normal but cache miss rate at 95%
- 44%
[Eval][v2][zh] Cache stampede after Redis key expiry on Black Friday morning
DB CPU 100%, p99 latency 15s, intermittent 503s, Redis CPU normal but cache miss rate at 95%
- 28%
[Eval][v2][en] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.