[Eval][v1][en] Cache stampede after Redis key expiry on Black Friday morning
service: catalog-svccreated: 6/3/2026, 12:06:21 AM
Raw incident context
Time: 09:00 UTC, Black Friday. catalog-svc latency exploded at exactly 09:00:00 UTC. Symptoms: - catalog-svc p99: 15s (baseline 80ms) - 503 rate: 8% (intermittent during DB overload) - Postgres CPU: 100% sustained, lock waits climbing - Redis CPU: 25% (looks healthy) - Redis cache miss rate for keys matching "catalog:item:*": 95% (baseline 2%) - All product detail page requests are flooding through to DB Background: - We pre-warm the homepage catalog cache every night at 02:00 UTC with TTL=7h - 02:00 UTC + 7h = 09:00 UTC ← all keys expired simultaneously - Black Friday traffic ramp: 12x normal at 09:00 UTC (marketing email blast) - No per-key locking; every cache miss triggers a fresh DB query Logs: - 08:59:58 UTC — last cache hit logged - 09:00:00 UTC — first wave of "cache miss + DB query" log lines - 09:00:02 UTC — DB connection pool hits ceiling, queries start queueing On-call: - 09:02 — pager (multiple alerts: DBHighCPU, CatalogSvcLatencyHigh, CheckoutErrorRate) - 09:03 — diagnosis: cache stampede confirmed by Redis miss-rate metric - 09:05 — debating: extend TTL (won't help in flight), warm cache manually (DB is the bottleneck), or shed load
Summary
At 09:00 UTC on Black Friday, a cache stampede in catalog-svc caused all product detail cache keys to expire simultaneously, flooding Postgres with 12x normal traffic. DB CPU hit 100%, p99 latency rose to 15s, and 8% of requests returned 503. Redis is healthy but cache miss rate is 95%.
Severity reasoning: User-facing outage with 8% error rate and 15s latency, revenue impact on Black Friday, and potential for cascading failures to checkout service.
deepseek-chat·prompt v1·output: en·11428ms·1424↑ / 1507↓ tok·$0.00204
Root cause hypotheses
- highCache TTL of 7h caused all keys to expire at 09:00 UTC, coinciding with Black Friday traffic ramp.
Evidence: Cache miss rate jumped from 2% to 95% at 09:00 UTC; last cache hit at 08:59:58 UTC; pre-warm at 02:00 UTC + 7h = 09:00 UTC.
- highNo per-key locking or probabilistic early expiration allowed all concurrent requests to trigger DB queries on cache miss.
Evidence: Every cache miss triggers a fresh DB query; no locking mechanism in place; DB connection pool saturated immediately.
- mediumBlack Friday marketing email blast caused sudden traffic spike exceeding DB capacity.
Evidence: Traffic at 12x normal at 09:00 UTC; DB CPU 100% sustained; lock waits climbing.
Investigation checklist
- Check Redis cache miss rate for catalog keys.
redis-cli --raw INFO stats | grep keyspace_missesExpected: keyspace_misses_hits ratio > 0.95 (95% miss rate)
- Check Postgres active queries and lock waits.
SELECT pid, state, wait_event_type, wait_event, query FROM pg_stat_activity WHERE state = 'active' ORDER BY query_start;Expected: Many queries waiting on 'relation' lock or 'datafile' read
- Check DB connection pool saturation.
SELECT count(*) FROM pg_stat_activity WHERE state = 'active';Expected: Count near max_connections (e.g., 200)
- Check catalog-svc error rate and latency.
kubectl -n production top pods -l app=catalog-svcExpected: High CPU/memory usage; also check via Prometheus: rate(catalog_request_duration_seconds_count{status=~"5.."}[5m])
- Verify cache key TTL distribution.
redis-cli --scan --pattern 'catalog:item:*' | head -100 | xargs -L1 redis-cli ttlExpected: Most keys have TTL near 0 or negative (expired)
Mitigation plan
Enable rate limiting on catalog-svc to shed excess load (e.g., 50% of requests return 429).
Risk: Some users will see errors, but prevents total outage.
Rollback: Disable rate limiting by reverting config change.
Manually warm the cache by running a script that re-queries DB for popular items and sets cache with jittered TTL (e.g., 6-8h).
Risk: Script will add load to already overloaded DB; must be throttled.
Rollback: Kill the script; cache will repopulate naturally over time.
Increase DB connection pool and scale up DB instance (e.g., increase max_connections and CPU).
Risk: May cause memory pressure; requires restart.
Rollback: Revert config and scale down.
Customer impact
Users on Black Friday are experiencing 15-second load times and intermittent errors (503) when viewing product details. Checkout may also be affected due to catalog dependency. No ETA yet.
Postmortem draft
Postmortem: Catalog Cache Stampede on Black Friday
Summary: At 09:00 UTC, all catalog cache keys expired simultaneously, causing a stampede to Postgres. DB CPU 100%, p99 latency 15s, 8% 503s.
Timeline:
- 08:59:58 UTC: Last cache hit
- 09:00:00 UTC: Cache miss wave begins
- 09:00:02 UTC: DB connection pool saturated
- 09:02 UTC: Pager received
- 09:03 UTC: Diagnosis: cache stampede
- [Mitigation actions and times]
Impact: Black Friday traffic, 12x normal, 8% error rate, 15s latency.
Root Cause: Cache TTL of 7h without jitter caused all keys to expire at peak traffic. No per-key locking or probabilistic early expiration.
What Went Well: Redis remained healthy; monitoring alerted quickly.
What Went Poorly: Cache design lacked jitter and stampede protection; no load shedding in place.
Action Items:
- Add jitter to cache TTL (P0)
- Implement probabilistic early expiration (P0)
- Add rate limiting to catalog-svc (P1)
- Pre-warm cache with staggered TTLs (P1)
Follow-ups
- P0Add jitter to cache TTL (e.g., 6-8h random) for all catalog keys.— service owner
- P0Implement probabilistic early expiration (e.g., XFetch algorithm) to prevent stampedes.— service owner
- P1Add rate limiting to catalog-svc with circuit breaker for DB overload.— on-call SRE
- P1Review and improve cache pre-warming strategy to avoid simultaneous expiry.— platform team
- P2Set up alert on cache miss rate > 50% to detect stampedes early.— on-call SRE
Similar past incidents
lexical match (pg_trgm)
- 81%
[Eval][v2][en] Cache stampede after Redis key expiry on Black Friday morning
DB CPU 100%, p99 latency 15s, intermittent 503s, Redis CPU normal but cache miss rate at 95%
- 65%
[Eval][v1][en] Cache stampede after Redis key expiry on Black Friday morning
DB CPU 100%, p99 latency 15s, intermittent 503s, Redis CPU normal but cache miss rate at 95%
- 49%
[Eval][v1][zh] Cache stampede after Redis key expiry on Black Friday morning
DB CPU 100%, p99 latency 15s, intermittent 503s, Redis CPU normal but cache miss rate at 95%
- 48%
[Eval][v2][zh] Cache stampede after Redis key expiry on Black Friday morning
DB CPU 100%, p99 latency 15s, intermittent 503s, Redis CPU normal but cache miss rate at 95%
- 26%
[Eval][v1][en] Order service OOM crashloop following v3.7 deploy
Pods OOM-killed every ~20min, restart loop, p99 latency degraded, ~3% requests timing out