[Eval][v1][en] Cache stampede after Redis key expiry on Black Friday morning
service: catalog-svccreated: 6/3/2026, 12:06:05 AM
Raw incident context
Time: 09:00 UTC, Black Friday. catalog-svc latency exploded at exactly 09:00:00 UTC. Symptoms: - catalog-svc p99: 15s (baseline 80ms) - 503 rate: 8% (intermittent during DB overload) - Postgres CPU: 100% sustained, lock waits climbing - Redis CPU: 25% (looks healthy) - Redis cache miss rate for keys matching "catalog:item:*": 95% (baseline 2%) - All product detail page requests are flooding through to DB Background: - We pre-warm the homepage catalog cache every night at 02:00 UTC with TTL=7h - 02:00 UTC + 7h = 09:00 UTC ← all keys expired simultaneously - Black Friday traffic ramp: 12x normal at 09:00 UTC (marketing email blast) - No per-key locking; every cache miss triggers a fresh DB query Logs: - 08:59:58 UTC — last cache hit logged - 09:00:00 UTC — first wave of "cache miss + DB query" log lines - 09:00:02 UTC — DB connection pool hits ceiling, queries start queueing On-call: - 09:02 — pager (multiple alerts: DBHighCPU, CatalogSvcLatencyHigh, CheckoutErrorRate) - 09:03 — diagnosis: cache stampede confirmed by Redis miss-rate metric - 09:05 — debating: extend TTL (won't help in flight), warm cache manually (DB is the bottleneck), or shed load
Summary
At 09:00 UTC on Black Friday, a cache stampede in catalog-svc caused all product detail page requests to hit Postgres simultaneously after a pre-warmed cache (TTL=7h) expired. DB CPU hit 100%, p99 latency rose to 15s, and 8% of requests returned 503. Redis is healthy but cache miss rate is 95%.
Severity reasoning: User-facing outage with 8% error rate and 15s latency, revenue impact on Black Friday, and DB overload risking data integrity. Full outage of a critical system (catalog) for a subset of users.
deepseek-chat·prompt v1·output: en·13353ms·1424↑ / 1736↓ tok·$0.00229
Root cause hypotheses
- highCache stampede from simultaneous TTL expiry of pre-warmed catalog cache at 09:00 UTC combined with 12x Black Friday traffic.
Evidence: Redis cache miss rate for 'catalog:item:*' jumped from 2% to 95% at 09:00 UTC; last cache hit at 08:59:58 UTC; pre-warm TTL=7h from 02:00 UTC expires exactly at 09:00 UTC.
- highDB connection pool exhaustion due to concurrent cache miss queries overwhelming Postgres.
Evidence: DB CPU 100% sustained, lock waits climbing, connection pool hit ceiling at 09:00:02 UTC, queries started queueing.
- mediumMissing per-key locking or request coalescing in cache miss handling.
Evidence: No per-key locking; every cache miss triggers a fresh DB query, leading to duplicate queries for the same key.
Investigation checklist
- Check current DB connection pool usage and query queue depth.
kubectl exec -n prod catalog-svc-0 -- pg_isready -h catalog-db -U app && kubectl exec -n prod catalog-svc-0 -- psql -h catalog-db -U app -c "SELECT count(*) FROM pg_stat_activity WHERE state = 'active';"Expected: Active connections near max (e.g., 100/100) and high wait count.
- Identify top queries consuming DB CPU.
kubectl exec -n prod catalog-svc-0 -- psql -h catalog-db -U app -c "SELECT query, calls, total_time, mean_time FROM pg_stat_statements ORDER BY total_time DESC LIMIT 10;"Expected: Multiple identical SELECT queries for catalog items (e.g., 'SELECT * FROM items WHERE id = $1').
- Confirm Redis cache miss rate and key pattern.
kubectl exec -n prod catalog-svc-0 -- redis-cli -h catalog-redis INFO stats | grep -E 'keyspace_misses|keyspace_hits'Expected: keyspace_misses >> keyspace_hits, miss rate ~95%.
- Check if any cache keys are still valid (e.g., recently accessed items).
kubectl exec -n prod catalog-svc-0 -- redis-cli -h catalog-redis TTL catalog:item:popular123Expected: TTL = -2 (expired) or very low.
- Verify traffic spike from marketing email blast.
kubectl exec -n prod catalog-svc-0 -- curl -s http://catalog-svc:8080/metrics | grep 'http_requests_total' | tail -5Expected: Request rate 12x normal (e.g., 12000 req/s vs 1000 req/s).
Mitigation plan
Enable request coalescing for cache misses (e.g., use Redis SETNX with short TTL to deduplicate concurrent misses for same key).
Risk: May add latency for the first request per key; no destructive operations.
Rollback: Disable coalescing feature flag.
Manually warm the cache by running a script that re-queries DB for popular items and writes to Redis with a staggered TTL (e.g., 5-10 min jitter).
Risk: Additional DB load during warm-up; could worsen DB CPU if not throttled.
Rollback: Stop the warm-up script immediately; DB load will drop as cache fills.
Shed load by temporarily blocking non-critical traffic (e.g., disable product recommendations API) via feature flag or rate limiter.
Risk: Degraded user experience for recommendations; no data loss.
Rollback: Re-enable recommendations API.
Customer impact
Customers on Black Friday are experiencing 15-second load times and intermittent 503 errors when viewing product detail pages. Approximately 8% of requests fail. No data loss. ETA for full recovery: 30 minutes after mitigation actions.
Postmortem draft
Postmortem: Catalog Cache Stampede on Black Friday
Summary: At 09:00 UTC, a pre-warmed cache with TTL=7h expired simultaneously, causing a stampede to Postgres under 12x traffic. DB CPU 100%, p99 latency 15s, 8% 503s.
Timeline:
- 02:00 UTC: Pre-warm cache completed (TTL=7h)
- 08:59:58: Last cache hit
- 09:00:00: Cache miss wave begins
- 09:00:02: DB connection pool exhausted
- 09:02: Pager received
- 09:03: Diagnosis: cache stampede
- 09:05: Mitigation actions started
Impact: 8% error rate, 15s latency for catalog-svc, affecting all product detail pages.
Root Cause: Cache stampede due to simultaneous TTL expiry without per-key locking or request coalescing.
What Went Well:
- Monitoring detected the issue within 2 minutes.
- Redis remained healthy.
What Went Poorly:
- No request coalescing for cache misses.
- Pre-warm TTL not staggered.
- No circuit breaker for DB overload.
Action Items:
- [P0] Implement request coalescing for cache misses.
- [P1] Add jitter to cache TTLs.
- [P1] Add circuit breaker for DB queries.
- [P2] Review pre-warm strategy for high-traffic events.
Follow-ups
- P0Implement request coalescing for cache misses (e.g., Redis SETNX with short TTL).— service owner
- P1Add jitter to cache TTLs to prevent simultaneous expiry.— service owner
- P1Add circuit breaker for DB queries to shed load under high latency.— platform team
- P2Review pre-warm strategy for Black Friday and other high-traffic events.— on-call SRE
- P2Add alert for cache miss rate > 50%.— platform team
Similar past incidents
lexical match (pg_trgm)
- 70%
[Eval][v2][en] Cache stampede after Redis key expiry on Black Friday morning
DB CPU 100%, p99 latency 15s, intermittent 503s, Redis CPU normal but cache miss rate at 95%
- 62%
[Eval][v1][en] Cache stampede after Redis key expiry on Black Friday morning
DB CPU 100%, p99 latency 15s, intermittent 503s, Redis CPU normal but cache miss rate at 95%
- 50%
[Eval][v1][zh] Cache stampede after Redis key expiry on Black Friday morning
DB CPU 100%, p99 latency 15s, intermittent 503s, Redis CPU normal but cache miss rate at 95%
- 49%
[Eval][v2][zh] Cache stampede after Redis key expiry on Black Friday morning
DB CPU 100%, p99 latency 15s, intermittent 503s, Redis CPU normal but cache miss rate at 95%
- 26%
[Eval][v2][en] Order service OOM crashloop following v3.7 deploy
Pods OOM-killed every ~20min, restart loop, p99 latency degraded, ~3% requests timing out