← All incidents

[Eval][v3][en] Cache stampede after Redis key expiry on Black Friday morning

service: catalog-svccreated: 6/2/2026, 11:42:07 PM

Raw incident context

Time: 09:00 UTC, Black Friday. catalog-svc latency exploded at exactly 09:00:00 UTC.

Symptoms:
- catalog-svc p99: 15s (baseline 80ms)
- 503 rate: 8% (intermittent during DB overload)
- Postgres CPU: 100% sustained, lock waits climbing
- Redis CPU: 25% (looks healthy)
- Redis cache miss rate for keys matching "catalog:item:*": 95% (baseline 2%)
- All product detail page requests are flooding through to DB

Background:
- We pre-warm the homepage catalog cache every night at 02:00 UTC with TTL=7h
- 02:00 UTC + 7h = 09:00 UTC ← all keys expired simultaneously
- Black Friday traffic ramp: 12x normal at 09:00 UTC (marketing email blast)
- No per-key locking; every cache miss triggers a fresh DB query

Logs:
- 08:59:58 UTC — last cache hit logged
- 09:00:00 UTC — first wave of "cache miss + DB query" log lines
- 09:00:02 UTC — DB connection pool hits ceiling, queries start queueing

On-call:
- 09:02 — pager (multiple alerts: DBHighCPU, CatalogSvcLatencyHigh, CheckoutErrorRate)
- 09:03 — diagnosis: cache stampede confirmed by Redis miss-rate metric
- 09:05 — debating: extend TTL (won't help in flight), warm cache manually (DB is the bottleneck), or shed load

Summary

SEV1

At 09:00 UTC on Black Friday, a cache stampede caused catalog-svc p99 latency to spike to 15s and intermittent 503 errors. Pre-warmed Redis keys with 7h TTL expired simultaneously, coinciding with a 12x traffic surge from a marketing email blast. Every cache miss triggered a DB query, saturating Postgres CPU at 100% and exhausting the connection pool. Redis CPU remained healthy but cache miss rate hit 95%.

Severity reasoning: User-facing outage with error rate >1% (8% 503s) for >5 min, revenue path broken (product detail pages failing), and regional unavailability risk. Matches SEV1 rubric: error rate >1% for >5 min and revenue path broken.

deepseek-chat·prompt v3·output: en·14749ms·2056↑ / 2291↓ tok·$0.00308

Root cause hypotheses

  • highCache stampede due to simultaneous key expiration at 09:00 UTC

    Evidence: Redis cache miss rate for 'catalog:item:*' keys jumped from 2% to 95% at 09:00 UTC. Pre-warm TTL=7h from 02:00 UTC expired exactly at 09:00 UTC. Logs show last cache hit at 08:59:58, first miss at 09:00:00.

  • highDB connection pool exhaustion from concurrent cache miss queries

    Evidence: Postgres CPU 100% sustained, lock waits climbing, connection pool hit ceiling at 09:00:02 UTC. Queries started queueing. 8% 503s indicate pool exhaustion.

  • highBlack Friday traffic surge amplified cache miss impact

    Evidence: Traffic ramp 12x normal at 09:00 UTC due to marketing email blast. Normal cache hit ratio would have absorbed this, but with 95% miss rate, all traffic hit DB.

  • mediumMissing per-key locking or request coalescing for cache misses

    Evidence: No per-key locking; every cache miss triggers a fresh DB query. This is a design gap that allowed stampede to overwhelm DB.

  • lowDB query performance degradation due to lock contention

    Evidence: Lock waits climbing indicates queries are blocking each other. Could be due to heavy read load or inefficient queries, but primary cause is volume.

Investigation checklist

  1. Check Redis cache miss rate for catalog keys
    redis-cli -h redis-cache -p 6379 INFO stats | grep -i 'keyspace_misses' && redis-cli -h redis-cache -p 6379 --raw GET 'catalog:item:*' | head -5

    Expected: Miss rate near 100% for catalog keys; keys missing from cache.

  2. Check Postgres active queries and wait events
    psql -h pg-primary -U app -c "SELECT pid, query_start, state, wait_event_type, wait_event, query FROM pg_stat_activity WHERE state != 'idle' ORDER BY query_start LIMIT 20;"

    Expected: Many queries waiting on 'LWLock' or 'IO' with long query_start times.

  3. Check DB connection pool utilization
    psql -h pg-primary -U app -c "SELECT count(*) FROM pg_stat_activity WHERE state = 'active';" && psql -h pg-primary -U app -c "SHOW max_connections;"

    Expected: Active connections near max_connections (e.g., 100/100).

  4. Check catalog-svc error logs for 503s
    kubectl logs -n prod -l app=catalog-svc --since=15m | grep -i '503' | tail -20

    Expected: Lines indicating 'connection refused' or 'timeout' from DB.

  5. Verify cache TTL configuration
    kubectl exec -n prod deploy/catalog-svc -- cat /app/config.yaml | grep -A2 'cache.*ttl'

    Expected: TTL set to 7h for catalog keys.

  6. Check Redis CPU and memory
    redis-cli -h redis-cache -p 6379 INFO cpu | grep 'used_cpu_sys' && redis-cli -h redis-cache -p 6379 INFO memory | grep 'used_memory_human'

    Expected: CPU low (~25%), memory usage normal.

Mitigation plan

  • Shed load by temporarily blocking non-critical traffic (e.g., marketing redirects) at the load balancer or via feature flag.

    Risk: May affect marketing campaigns but preserves core catalog functionality.

    Rollback: Remove the block rule or disable feature flag.

  • Manually warm the cache by running a script to re-populate catalog keys from DB, but with rate limiting to avoid further DB overload.

    Risk: If rate limit is too high, DB may become more saturated. Safer to run with low concurrency.

    Rollback: Stop the script; keys will expire naturally.

  • Increase DB connection pool size temporarily (e.g., from 100 to 200) to reduce queuing.

    Risk: May increase DB CPU further; monitor closely. Could cause OOM if memory is tight.

    Rollback: Revert to original pool size.

  • Extend cache TTL for existing keys (e.g., set TTL to 24h) to prevent immediate re-expiry.

    Risk: Does not help in-flight requests; only prevents future stampedes. May cause stale data.

    Rollback: Reset TTL to original value.

  • Enable request coalescing for cache misses (e.g., using Redis SETNX or a mutex) to deduplicate DB queries.

    Risk: Requires code change; not immediate. Could introduce latency if not implemented correctly.

    Rollback: Revert code change.

Customer impact

Users are experiencing slow loading and intermittent errors (503) when viewing product detail pages. Approximately 8% of requests are failing, and page load times have increased from under 100ms to over 15 seconds. This affects all users during the Black Friday sales event.

Postmortem draft

Postmortem: catalog-svc Cache Stampede

Summary

On Black Friday at 09:00 UTC, a cache stampede caused catalog-svc p99 latency to spike to 15s and 8% error rate. Pre-warmed Redis keys expired simultaneously, coinciding with a 12x traffic surge. DB CPU hit 100% and connection pool exhausted.

Timeline (UTC)

  • 02:00 — Pre-warm cache job runs, sets TTL=7h
  • 08:59:58 — Last cache hit
  • 09:00:00 — Cache miss rate jumps to 95%, DB queries flood
  • 09:00:02 — DB connection pool exhausted
  • 09:02 — Pager alerts received
  • 09:03 — Diagnosis: cache stampede
  • [FILL IN] — Mitigation actions taken
  • [FILL IN] — Incident resolved

Impact

  • p99 latency: 15s (baseline 80ms)
  • Error rate: 8% 503s
  • Affected users: all catalog-svc users during Black Friday
  • Revenue impact: [FILL IN]

Root Cause

Simultaneous expiration of pre-warmed cache keys at 09:00 UTC combined with 12x traffic surge caused a cache stampede. No per-key locking or request coalescing allowed all cache misses to hit the database simultaneously, overwhelming Postgres.

Detection

Alerts: DBHighCPU, CatalogSvcLatencyHigh, CheckoutErrorRate. Detected within 2 minutes via pager.

Response

  • [FILL IN] actions taken
  • [FILL IN] time to mitigate

What Went Well

  • Alerting worked and paged quickly
  • On-call identified stampede within 1 minute

What Went Poorly

  • No pre-warm key staggering or jitter
  • No request coalescing for cache misses
  • No load shedding mechanism for cache stampede

Action Items

  • [ ] Add jitter to cache TTLs to prevent simultaneous expiry
  • [ ] Implement request coalescing (e.g., SETNX) for cache misses
  • [ ] Add load shedding at load balancer for cache stampede scenarios
  • [ ] Review and improve pre-warm strategy
  • [ ] Add monitoring for cache miss rate spikes

Follow-ups

  • P0Add jitter to cache TTLs to prevent simultaneous key expirationcatalog-svc team
  • P0Implement request coalescing for cache misses using Redis SETNX or similarcatalog-svc team
  • P1Add load shedding at load balancer for cache stampede scenariosplatform team
  • P1Review and improve pre-warm strategy (staggered refresh, shorter TTL with background refresh)catalog-svc team
  • P1Add monitoring and alert for cache miss rate spikesobservability team
  • P2Document cache stampede runbook and mitigation stepson-call SRE