[Eval][v3][en] Cache stampede after Redis key expiry on Black Friday morning
service: catalog-svccreated: 6/3/2026, 12:09:32 AM
Raw incident context
Time: 09:00 UTC, Black Friday. catalog-svc latency exploded at exactly 09:00:00 UTC. Symptoms: - catalog-svc p99: 15s (baseline 80ms) - 503 rate: 8% (intermittent during DB overload) - Postgres CPU: 100% sustained, lock waits climbing - Redis CPU: 25% (looks healthy) - Redis cache miss rate for keys matching "catalog:item:*": 95% (baseline 2%) - All product detail page requests are flooding through to DB Background: - We pre-warm the homepage catalog cache every night at 02:00 UTC with TTL=7h - 02:00 UTC + 7h = 09:00 UTC ← all keys expired simultaneously - Black Friday traffic ramp: 12x normal at 09:00 UTC (marketing email blast) - No per-key locking; every cache miss triggers a fresh DB query Logs: - 08:59:58 UTC — last cache hit logged - 09:00:00 UTC — first wave of "cache miss + DB query" log lines - 09:00:02 UTC — DB connection pool hits ceiling, queries start queueing On-call: - 09:02 — pager (multiple alerts: DBHighCPU, CatalogSvcLatencyHigh, CheckoutErrorRate) - 09:03 — diagnosis: cache stampede confirmed by Redis miss-rate metric - 09:05 — debating: extend TTL (won't help in flight), warm cache manually (DB is the bottleneck), or shed load
Summary
At 09:00 UTC on Black Friday, a cache stampede caused catalog-svc p99 latency to spike to 15s (baseline 80ms) and intermittent 503 errors at 8%. The pre-warmed Redis cache for catalog items expired simultaneously (TTL=7h from 02:00 UTC) under 12x normal traffic from a marketing email blast, flooding Postgres with queries and driving CPU to 100% with lock waits. Mitigation is underway to shed load and manually warm the cache.
Severity reasoning: User-facing outage: p99 latency 15s (baseline 80ms) and 503 error rate 8% > 1% for >5 minutes, affecting all product detail page requests. Revenue path is broken as catalog-svc is critical for Black Friday shopping. This meets SEV1 criteria.
deepseek-chat·prompt v3·output: en·15943ms·2056↑ / 2154↓ tok·$0.00292
Root cause hypotheses
- highCache stampede due to simultaneous key expiration at 09:00 UTC under high traffic
Evidence: Redis cache miss rate for 'catalog:item:*' jumped to 95% at 09:00 UTC (baseline 2%). Last cache hit at 08:59:58 UTC. TTL=7h from 02:00 UTC pre-warm aligns with expiration at 09:00 UTC.
- highPostgres connection pool exhaustion from concurrent cache miss queries
Evidence: Logs at 09:00:02 UTC show 'DB connection pool hits ceiling, queries start queueing'. Postgres CPU at 100% sustained with lock waits climbing.
- lowDatabase query performance degradation due to missing index or slow query
Evidence: No direct evidence of missing indexes; baseline p99 was 80ms. The sudden spike at 09:00 UTC correlates with cache expiration, not a schema change. Unlikely.
- lowNetwork latency or resource contention on database host
Evidence: No network errors reported. CPU is 100% due to query load, not I/O wait. Likely not a network issue.
- lowRedis instance failure or misconfiguration causing cache misses
Evidence: Redis CPU is 25% (healthy), and cache miss rate is specific to catalog keys. Redis itself is operational. Unlikely.
Investigation checklist
- Check current Redis cache miss rate for catalog keys
redis-cli -h redis-catalog -p 6379 INFO stats | grep -E 'keyspace_misses|keyspace_hits'Expected: keyspace_misses should be high (e.g., >90%) if stampede ongoing; if low, stampede may have subsided.
- Check Postgres active queries and lock waits
psql -h pg-catalog -U app -c "SELECT pid, query_start, state, wait_event_type, query FROM pg_stat_activity WHERE state != 'idle' ORDER BY query_start LIMIT 20;"Expected: Many queries with state 'active' and wait_event_type 'LWLock' or 'Lock' indicating contention.
- Check Postgres connection pool usage
psql -h pg-catalog -U app -c "SELECT count(*) FROM pg_stat_activity WHERE state != 'idle';"Expected: Count near max_connections (e.g., 200) indicating pool exhaustion.
- Check catalog-svc error logs for 503s
kubectl logs -n prod -l app=catalog-svc --since=15m | grep -E 'HTTP 503|timeout|connection refused' | tail -50Expected: Multiple 503 lines with timestamps after 09:00 UTC.
- Verify cache TTL configuration
kubectl exec -n prod deploy/catalog-svc -- cat /app/config.yaml | grep -A2 'cache.*ttl'Expected: TTL set to 7h (25200s) for catalog items.
- Check if any recent deployment changed cache behavior
kubectl rollout history -n prod deploy/catalog-svcExpected: No recent changes; last rollout >24h ago.
Mitigation plan
Shed load by rate-limiting catalog-svc at ingress (e.g., nginx) to reduce DB pressure
Risk: Some users will see 429 errors, but prevents complete outage. Blast radius: all catalog requests.
Rollback: Remove rate limit by reverting ingress config change.
Manually warm Redis cache by running a script to re-populate popular catalog items with a longer TTL (e.g., 12h)
Risk: Script may increase DB load further if not throttled. Blast radius: DB CPU may spike temporarily.
Rollback: Stop the script; cache will expire naturally.
Increase Postgres connection pool size temporarily (e.g., from 100 to 200) to handle queued queries
Risk: May increase CPU/memory pressure on DB. Blast radius: DB host resources.
Rollback: Revert connection pool size to original value.
Extend cache TTL for catalog items to 12h to prevent near-future expiration during peak
Risk: Stale data may be served for up to 12h if items change. Blast radius: catalog freshness.
Rollback: Revert TTL to 7h.
Customer impact
Users are experiencing slow loading (up to 15 seconds) and intermittent errors when viewing product detail pages. Approximately 12x normal traffic from a Black Friday marketing campaign is affected. Estimated 100% of users hitting catalog pages are impacted. No ETA yet.
Postmortem draft
Summary
On Black Friday at 09:00 UTC, a cache stampede caused catalog-svc p99 latency to spike to 15s and 8% 503 errors. The pre-warmed Redis cache expired simultaneously under 12x traffic, flooding Postgres.
Timeline (UTC)
- 02:00: Pre-warm cache with TTL=7h
- 08:59:58: Last cache hit
- 09:00:00: Cache miss wave begins
- 09:00:02: DB connection pool exhausted
- 09:02: Pager triggered
- 09:03: Diagnosis: cache stampede
- [FILL IN] Mitigation actions taken
- [FILL IN] Incident resolved
Impact
- p99 latency: 15s (baseline 80ms)
- 503 error rate: 8%
- Affected users: all catalog page requests
- Revenue impact: [FILL IN]
Root Cause
Cache stampede due to simultaneous expiration of pre-warmed keys at 09:00 UTC, coinciding with 12x traffic from Black Friday marketing blast. No per-key locking or staggered expiration.
Detection
Alerts for DBHighCPU, CatalogSvcLatencyHigh, CheckoutErrorRate at 09:02. Redis miss-rate metric confirmed stampede.
Response
- [FILL IN] Load shedding, manual cache warm, connection pool increase
- [FILL IN] Communication
What Went Well
- Alerting caught the issue quickly
- Redis miss-rate metric was key to diagnosis
What Went Poorly
- Cache TTL not staggered for high-traffic events
- No per-key locking or circuit breaker for cache misses
- Pre-warm timing not aligned with traffic patterns
Action Items
- [ ] Implement staggered cache TTL with jitter
- [ ] Add per-key locking (e.g., SETNX) for cache misses
- [ ] Set up circuit breaker for DB queries under load
- [ ] Review pre-warm schedule for Black Friday
- [ ] Add load shedding automation
Follow-ups
- P0Implement staggered cache TTL with random jitter to prevent simultaneous expiration— catalog-svc team
- P0Add per-key locking (e.g., Redis SETNX) for cache misses to reduce DB load— catalog-svc team
- P1Set up circuit breaker for catalog-svc DB queries to shed load automatically— platform team
- P1Review pre-warm schedule and TTL for Black Friday and other peak events— catalog-svc team
- P1Add alert on cache miss rate spike to detect stampedes earlier— on-call SRE
- P2Document incident response steps for cache stampede in runbook— on-call SRE
Similar past incidents
lexical match (pg_trgm)
- 74%
[Eval][v1][en] Cache stampede after Redis key expiry on Black Friday morning
DB CPU 100%, p99 latency 15s, intermittent 503s, Redis CPU normal but cache miss rate at 95%
- 61%
[Eval][v2][en] Cache stampede after Redis key expiry on Black Friday morning
DB CPU 100%, p99 latency 15s, intermittent 503s, Redis CPU normal but cache miss rate at 95%
- 41%
[Eval][v1][zh] Cache stampede after Redis key expiry on Black Friday morning
DB CPU 100%, p99 latency 15s, intermittent 503s, Redis CPU normal but cache miss rate at 95%
- 40%
[Eval][v2][zh] Cache stampede after Redis key expiry on Black Friday morning
DB CPU 100%, p99 latency 15s, intermittent 503s, Redis CPU normal but cache miss rate at 95%
- 27%
[Eval][v2][en] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted