[Eval][v1][en] Cache stampede after Redis key expiry on Black Friday morning
service: catalog-svccreated: 5/25/2026, 10:23:37 PM
Raw incident context
Time: 09:00 UTC, Black Friday. catalog-svc latency exploded at exactly 09:00:00 UTC. Symptoms: - catalog-svc p99: 15s (baseline 80ms) - 503 rate: 8% (intermittent during DB overload) - Postgres CPU: 100% sustained, lock waits climbing - Redis CPU: 25% (looks healthy) - Redis cache miss rate for keys matching "catalog:item:*": 95% (baseline 2%) - All product detail page requests are flooding through to DB Background: - We pre-warm the homepage catalog cache every night at 02:00 UTC with TTL=7h - 02:00 UTC + 7h = 09:00 UTC ← all keys expired simultaneously - Black Friday traffic ramp: 12x normal at 09:00 UTC (marketing email blast) - No per-key locking; every cache miss triggers a fresh DB query Logs: - 08:59:58 UTC — last cache hit logged - 09:00:00 UTC — first wave of "cache miss + DB query" log lines - 09:00:02 UTC — DB connection pool hits ceiling, queries start queueing On-call: - 09:02 — pager (multiple alerts: DBHighCPU, CatalogSvcLatencyHigh, CheckoutErrorRate) - 09:03 — diagnosis: cache stampede confirmed by Redis miss-rate metric - 09:05 — debating: extend TTL (won't help in flight), warm cache manually (DB is the bottleneck), or shed load
Summary
At 09:00 UTC on Black Friday, a cache stampede caused catalog-svc p99 latency to spike from 80ms to 15s, with 8% 503 errors. Pre-warmed Redis keys with 7h TTL expired simultaneously, and 12x normal traffic from a marketing email blast flooded the DB, saturating Postgres CPU at 100% and causing connection pool exhaustion.
Severity reasoning: User-facing outage with 8% error rate and 15s latency, revenue impact on Black Friday, and potential data inconsistency from DB overload. Full outage of a critical customer-facing service.
deepseek-chat·prompt v1·output: en·12413ms
Root cause hypotheses
- highCache stampede from simultaneous key expiration at 09:00 UTC
Evidence: Redis cache miss rate for 'catalog:item:*' jumped to 95% at 09:00 UTC, coinciding with TTL expiry of pre-warmed keys (set at 02:00 UTC, TTL=7h).
- highDB connection pool exhaustion due to surge of uncached queries
Evidence: Logs show DB connection pool hit ceiling at 09:00:02 UTC, queries started queueing. Postgres CPU at 100% sustained.
- highBlack Friday traffic spike overwhelming DB capacity
Evidence: Traffic ramp of 12x normal at 09:00 UTC from marketing email blast, combined with cache miss, caused DB overload.
- mediumMissing per-key locking or rate limiting on cache miss
Evidence: No per-key locking; every cache miss triggers a fresh DB query, leading to stampede. This is a design flaw.
Investigation checklist
- Confirm cache miss rate and key expiration pattern
redis-cli --raw -h <redis-host> INFO keyspace | grep catalog && redis-cli --raw -h <redis-host> TTL catalog:item:12345Expected: High miss rate and TTL near 0 for sample keys.
- Check DB connection pool usage and active queries
kubectl exec -n prod catalog-svc-0 -- psql -c "SELECT count(*) FROM pg_stat_activity WHERE state = 'active';"Expected: Active connections near max_pool size (e.g., 100/100).
- Identify slow queries contributing to CPU saturation
kubectl exec -n prod catalog-svc-0 -- psql -c "SELECT query, calls, total_time, mean_time FROM pg_stat_statements ORDER BY total_time DESC LIMIT 10;"Expected: Top queries are SELECT from catalog_item table with high total_time.
- Check if DB is CPU-bound or I/O-bound
kubectl exec -n prod catalog-svc-0 -- psql -c "SELECT pg_stat_get_backend_pid(s.backendid), s.wait_event_type FROM pg_stat_activity s WHERE s.state = 'active';"Expected: Most wait_event_type are 'CPU' or 'LWLock' (lock contention).
- Verify Redis is not the bottleneck
redis-cli --raw -h <redis-host> INFO CPU | grep used_cpu_sysExpected: CPU usage low (25% as reported).
Mitigation plan
Immediately shed load by rate-limiting catalog-svc at the ingress gateway to 50% of current traffic
Risk: Some users will see 429 errors, but prevents total outage. Destructive if not reverted.
Rollback: Gradually increase rate limit back to 100% after cache is warmed.
Manually warm Redis cache with a script that reads from a DB replica (not primary) to avoid further load
Risk: If replica is also overloaded, this could worsen. Use a dedicated read-only replica if available.
Rollback: Kill the warm-up script if DB load increases beyond 90%.
Extend TTL on existing keys to prevent immediate re-expiry (won't help in-flight but prevents second wave)
Risk: Stale data may be served for longer, but acceptable during incident.
Rollback: Set TTL back to original after incident.
Customer impact
Customers on Black Friday are experiencing 15-second load times and intermittent 503 errors when viewing product detail pages. Approximately 8% of requests fail. No data loss, but degraded experience for all users hitting catalog-svc.
Postmortem draft
Postmortem: Catalog-svc Cache Stampede on Black Friday
Summary
- Date: 2025-01-28
- Duration: 09:00 - 09:XX UTC
- Severity: SEV1
- Impact: 8% error rate, 15s p99 latency for catalog-svc
Timeline
- 08:59:58 UTC: Last cache hit
- 09:00:00 UTC: Cache miss wave begins
- 09:00:02 UTC: DB connection pool exhausted
- 09:02 UTC: Pager received
- 09:03 UTC: Diagnosis: cache stampede
- 09:05 UTC: Mitigation started (rate limiting, manual cache warm)
Impact
- Users: All catalog-svc users
- Revenue: Estimated $X loss
- Data: No loss
Root Cause
- Pre-warmed cache keys with 7h TTL expired simultaneously at 09:00 UTC
- Black Friday traffic 12x normal
- No per-key locking or rate limiting on cache miss
What Went Well
- Monitoring alerted quickly
- Redis metrics clearly showed cache miss rate
What Went Poorly
- Cache TTL not staggered
- No rate limiting on cache miss
- DB replica not used for warm-up
Action Items
- [ ] Implement per-key locking (owner: catalog team, P1)
- [ ] Stagger cache TTLs (owner: catalog team, P1)
- [ ] Add rate limiting at ingress (owner: platform team, P1)
- [ ] Set up read-only replica for cache warm-up (owner: DB team, P2)
Follow-ups
- P0Implement per-key locking (e.g., Redis SETNX) to prevent stampede— catalog-svc owner
- P0Stagger cache TTLs with jitter to avoid simultaneous expiry— catalog-svc owner
- P1Add rate limiting at ingress gateway for catalog-svc— platform team
- P1Provision a read-only DB replica for cache warm-up scripts— DB team
- P2Review and update incident response runbook for cache stampede— on-call SRE
Similar past incidents
lexical match (pg_trgm)
- 68%
[Eval][v2][en] Cache stampede after Redis key expiry on Black Friday morning
DB CPU 100%, p99 latency 15s, intermittent 503s, Redis CPU normal but cache miss rate at 95%
- 46%
[Eval][v1][zh] Cache stampede after Redis key expiry on Black Friday morning
DB CPU 100%, p99 latency 15s, intermittent 503s, Redis CPU normal but cache miss rate at 95%
- 45%
[Eval][v2][zh] Cache stampede after Redis key expiry on Black Friday morning
DB CPU 100%, p99 latency 15s, intermittent 503s, Redis CPU normal but cache miss rate at 95%
- 31%
[Eval][v2][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 29%
[Eval][v1][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts