[Eval][v2][en] Cache stampede after Redis key expiry on Black Friday morning
service: catalog-svccreated: 6/3/2026, 12:07:36 AM
Raw incident context
Time: 09:00 UTC, Black Friday. catalog-svc latency exploded at exactly 09:00:00 UTC. Symptoms: - catalog-svc p99: 15s (baseline 80ms) - 503 rate: 8% (intermittent during DB overload) - Postgres CPU: 100% sustained, lock waits climbing - Redis CPU: 25% (looks healthy) - Redis cache miss rate for keys matching "catalog:item:*": 95% (baseline 2%) - All product detail page requests are flooding through to DB Background: - We pre-warm the homepage catalog cache every night at 02:00 UTC with TTL=7h - 02:00 UTC + 7h = 09:00 UTC ← all keys expired simultaneously - Black Friday traffic ramp: 12x normal at 09:00 UTC (marketing email blast) - No per-key locking; every cache miss triggers a fresh DB query Logs: - 08:59:58 UTC — last cache hit logged - 09:00:00 UTC — first wave of "cache miss + DB query" log lines - 09:00:02 UTC — DB connection pool hits ceiling, queries start queueing On-call: - 09:02 — pager (multiple alerts: DBHighCPU, CatalogSvcLatencyHigh, CheckoutErrorRate) - 09:03 — diagnosis: cache stampede confirmed by Redis miss-rate metric - 09:05 — debating: extend TTL (won't help in flight), warm cache manually (DB is the bottleneck), or shed load
Summary
At 09:00 UTC on Black Friday, a cache stampede in catalog-svc caused all product detail page requests to hit the database simultaneously, spiking Postgres CPU to 100%, p99 latency to 15s, and causing intermittent 503s. The stampede was triggered by the expiration of pre-warmed cache keys (TTL=7h) at the same time as a 12x traffic surge from a marketing email blast.
Severity reasoning: User-facing outage: p99 latency 15s (baseline 80ms), 503 error rate 8% > 1% for >5 min, revenue path broken (product detail pages are critical for Black Friday sales). This meets SEV1 criteria.
deepseek-chat·prompt v2·output: en·15341ms·1805↑ / 1870↓ tok·$0.00254
Root cause hypotheses
- highCache stampede due to simultaneous key expiration and traffic surge
Evidence: Redis cache miss rate for 'catalog:item:*' jumped from 2% to 95% at 09:00 UTC, coinciding with TTL expiry of pre-warmed keys (set at 02:00 UTC, TTL=7h). DB CPU hit 100% immediately after.
- highDB connection pool exhaustion from concurrent queries
Evidence: Logs at 09:00:02 UTC show connection pool hitting ceiling and queries queueing. Lock waits climbing.
- mediumMissing per-key locking or rate limiting on cache miss
Evidence: No evidence of locking in code; every cache miss triggers a fresh DB query. This is a design gap.
- lowDatabase query performance degradation (e.g., missing index, slow query)
Evidence: No specific slow query identified yet; DB CPU is 100% but could be due to volume rather than individual slow queries.
- lowNetwork latency or DNS issue causing retries
Evidence: No network errors in logs; Redis CPU is normal (25%), ruling out network as primary cause.
Investigation checklist
- Check Redis cache miss rate for catalog keys
redis-cli -h redis-cache -p 6379 INFO stats | grep -i 'keyspace_misses'Expected: High miss rate (>90%) confirms stampede
- Check DB active connections and wait events
SELECT count(*), state, wait_event_type FROM pg_stat_activity WHERE state != 'idle' GROUP BY state, wait_event_type;Expected: Many connections in 'active' state with wait_event_type 'Lock' or 'IO'
- Check DB CPU usage
kubectl top pod -n prod -l app=catalog-db --containers | grep -i postgresExpected: CPU usage near 100%
- Check catalog-svc error logs for 503s
kubectl logs -n prod -l app=catalog-svc --since=10m | grep -E 'HTTP 503|timeout|connection refused' | tail -20Expected: Multiple 503 errors due to upstream DB timeout
- Verify cache key TTL and expiry pattern
redis-cli -h redis-cache -p 6379 TTL catalog:item:popular | head -5Expected: TTL near 0 or negative for many keys
- Check if any recent deployment changed cache logic
kubectl rollout history deployment/catalog-svc -n prodExpected: No recent changes; stampede is operational
Mitigation plan
Shed load by temporarily blocking non-critical traffic to catalog-svc (e.g., disable product recommendations via feature flag)
Risk: Reduces feature functionality but protects DB; may impact user experience but prevents total outage
Rollback: Re-enable feature flag once DB CPU drops below 70%
Increase Redis cache TTL for catalog keys to 12h to prevent immediate re-expiry
Risk: Stale data for up to 12h; acceptable during Black Friday
Rollback: Revert TTL to 7h after incident
Manually warm cache by running a script to re-populate popular keys from DB (if DB load allows)
Risk: Running additional DB queries could worsen load; only do if DB CPU < 80%
Rollback: Kill the warm-up script if DB CPU exceeds 90%
Scale up catalog-db read replicas to handle query load
Risk: May take minutes to provision; temporary cost increase
Rollback: Scale down after cache warms
Customer impact
Users browsing product detail pages on the website and mobile app are experiencing slow load times (up to 15 seconds) and intermittent errors (503). This affects all users during Black Friday shopping, estimated 12x normal traffic. No data loss has occurred.
Postmortem draft
Summary
[FILL IN]
Timeline (UTC)
- 02:00: Pre-warm cache job runs, sets TTL=7h
- 08:59:58: Last cache hit
- 09:00:00: Cache stampede begins; all keys expire simultaneously
- 09:00:02: DB connection pool exhausted
- 09:02: Pager alerts triggered
- 09:03: Diagnosis: cache stampede
- [FILL IN]: Mitigation actions taken
Impact
- p99 latency 15s (baseline 80ms)
- 503 error rate 8%
- DB CPU 100% for [duration]
- Affected all product detail page requests
Root Cause
Simultaneous expiration of pre-warmed cache keys (TTL=7h) at peak traffic, causing a cache stampede. No per-key locking or rate limiting on cache miss.
Detection
Alerts: DBHighCPU, CatalogSvcLatencyHigh, CheckoutErrorRate. Redis cache miss rate metric confirmed stampede.
Response
- [FILL IN: actions taken, e.g., load shedding, TTL increase, cache warm]
What Went Well
- Redis cache miss metric was key to quick diagnosis
- Feature flags allowed rapid load shedding
What Went Poorly
- Cache TTL not staggered to avoid simultaneous expiry
- No per-key locking or rate limiting on cache miss
- Pre-warm job did not account for traffic surge timing
Action Items
- [FILL IN: e.g., implement per-key locking, stagger TTLs, add circuit breaker]
Follow-ups
- P0Implement per-key locking (e.g., SETNX) on cache miss to prevent stampede— catalog-svc team
- P0Stagger cache TTLs with jitter to avoid simultaneous expiry— catalog-svc team
- P1Add circuit breaker for DB queries from catalog-svc— platform team
- P1Improve pre-warm job to align with traffic patterns and add random TTL offset— data engineering team
- P1Set up alert on cache miss rate spike (e.g., >50% deviation from baseline)— observability team
- P2Review and update incident response runbook for cache stampede scenarios— on-call SRE
Similar past incidents
lexical match (pg_trgm)
- 69%
[Eval][v2][en] Cache stampede after Redis key expiry on Black Friday morning
DB CPU 100%, p99 latency 15s, intermittent 503s, Redis CPU normal but cache miss rate at 95%
- 69%
[Eval][v1][en] Cache stampede after Redis key expiry on Black Friday morning
DB CPU 100%, p99 latency 15s, intermittent 503s, Redis CPU normal but cache miss rate at 95%
- 43%
[Eval][v2][zh] Cache stampede after Redis key expiry on Black Friday morning
DB CPU 100%, p99 latency 15s, intermittent 503s, Redis CPU normal but cache miss rate at 95%
- 42%
[Eval][v1][zh] Cache stampede after Redis key expiry on Black Friday morning
DB CPU 100%, p99 latency 15s, intermittent 503s, Redis CPU normal but cache miss rate at 95%
- 28%
[Eval][v2][en] Order service OOM crashloop following v3.7 deploy
Pods OOM-killed every ~20min, restart loop, p99 latency degraded, ~3% requests timing out