← All incidents

[Eval][v1][en] Cache stampede after Redis key expiry on Black Friday morning

service: catalog-svccreated: 6/3/2026, 12:06:21 AM

Raw incident context

Time: 09:00 UTC, Black Friday. catalog-svc latency exploded at exactly 09:00:00 UTC.

Symptoms:
- catalog-svc p99: 15s (baseline 80ms)
- 503 rate: 8% (intermittent during DB overload)
- Postgres CPU: 100% sustained, lock waits climbing
- Redis CPU: 25% (looks healthy)
- Redis cache miss rate for keys matching "catalog:item:*": 95% (baseline 2%)
- All product detail page requests are flooding through to DB

Background:
- We pre-warm the homepage catalog cache every night at 02:00 UTC with TTL=7h
- 02:00 UTC + 7h = 09:00 UTC ← all keys expired simultaneously
- Black Friday traffic ramp: 12x normal at 09:00 UTC (marketing email blast)
- No per-key locking; every cache miss triggers a fresh DB query

Logs:
- 08:59:58 UTC — last cache hit logged
- 09:00:00 UTC — first wave of "cache miss + DB query" log lines
- 09:00:02 UTC — DB connection pool hits ceiling, queries start queueing

On-call:
- 09:02 — pager (multiple alerts: DBHighCPU, CatalogSvcLatencyHigh, CheckoutErrorRate)
- 09:03 — diagnosis: cache stampede confirmed by Redis miss-rate metric
- 09:05 — debating: extend TTL (won't help in flight), warm cache manually (DB is the bottleneck), or shed load

Summary

SEV1

At 09:00 UTC on Black Friday, a cache stampede in catalog-svc caused all product detail cache keys to expire simultaneously, flooding Postgres with 12x normal traffic. DB CPU hit 100%, p99 latency rose to 15s, and 8% of requests returned 503. Redis is healthy but cache miss rate is 95%.

Severity reasoning: User-facing outage with 8% error rate and 15s latency, revenue impact on Black Friday, and potential for cascading failures to checkout service.

deepseek-chat·prompt v1·output: en·11428ms·1424↑ / 1507↓ tok·$0.00204

Root cause hypotheses

  • highCache TTL of 7h caused all keys to expire at 09:00 UTC, coinciding with Black Friday traffic ramp.

    Evidence: Cache miss rate jumped from 2% to 95% at 09:00 UTC; last cache hit at 08:59:58 UTC; pre-warm at 02:00 UTC + 7h = 09:00 UTC.

  • highNo per-key locking or probabilistic early expiration allowed all concurrent requests to trigger DB queries on cache miss.

    Evidence: Every cache miss triggers a fresh DB query; no locking mechanism in place; DB connection pool saturated immediately.

  • mediumBlack Friday marketing email blast caused sudden traffic spike exceeding DB capacity.

    Evidence: Traffic at 12x normal at 09:00 UTC; DB CPU 100% sustained; lock waits climbing.

Investigation checklist

  1. Check Redis cache miss rate for catalog keys.
    redis-cli --raw INFO stats | grep keyspace_misses

    Expected: keyspace_misses_hits ratio > 0.95 (95% miss rate)

  2. Check Postgres active queries and lock waits.
    SELECT pid, state, wait_event_type, wait_event, query FROM pg_stat_activity WHERE state = 'active' ORDER BY query_start;

    Expected: Many queries waiting on 'relation' lock or 'datafile' read

  3. Check DB connection pool saturation.
    SELECT count(*) FROM pg_stat_activity WHERE state = 'active';

    Expected: Count near max_connections (e.g., 200)

  4. Check catalog-svc error rate and latency.
    kubectl -n production top pods -l app=catalog-svc

    Expected: High CPU/memory usage; also check via Prometheus: rate(catalog_request_duration_seconds_count{status=~"5.."}[5m])

  5. Verify cache key TTL distribution.
    redis-cli --scan --pattern 'catalog:item:*' | head -100 | xargs -L1 redis-cli ttl

    Expected: Most keys have TTL near 0 or negative (expired)

Mitigation plan

  • Enable rate limiting on catalog-svc to shed excess load (e.g., 50% of requests return 429).

    Risk: Some users will see errors, but prevents total outage.

    Rollback: Disable rate limiting by reverting config change.

  • Manually warm the cache by running a script that re-queries DB for popular items and sets cache with jittered TTL (e.g., 6-8h).

    Risk: Script will add load to already overloaded DB; must be throttled.

    Rollback: Kill the script; cache will repopulate naturally over time.

  • Increase DB connection pool and scale up DB instance (e.g., increase max_connections and CPU).

    Risk: May cause memory pressure; requires restart.

    Rollback: Revert config and scale down.

Customer impact

Users on Black Friday are experiencing 15-second load times and intermittent errors (503) when viewing product details. Checkout may also be affected due to catalog dependency. No ETA yet.

Postmortem draft

Postmortem: Catalog Cache Stampede on Black Friday

Summary: At 09:00 UTC, all catalog cache keys expired simultaneously, causing a stampede to Postgres. DB CPU 100%, p99 latency 15s, 8% 503s.

Timeline:

  • 08:59:58 UTC: Last cache hit
  • 09:00:00 UTC: Cache miss wave begins
  • 09:00:02 UTC: DB connection pool saturated
  • 09:02 UTC: Pager received
  • 09:03 UTC: Diagnosis: cache stampede
  • [Mitigation actions and times]

Impact: Black Friday traffic, 12x normal, 8% error rate, 15s latency.

Root Cause: Cache TTL of 7h without jitter caused all keys to expire at peak traffic. No per-key locking or probabilistic early expiration.

What Went Well: Redis remained healthy; monitoring alerted quickly.

What Went Poorly: Cache design lacked jitter and stampede protection; no load shedding in place.

Action Items:

  • Add jitter to cache TTL (P0)
  • Implement probabilistic early expiration (P0)
  • Add rate limiting to catalog-svc (P1)
  • Pre-warm cache with staggered TTLs (P1)

Follow-ups

  • P0Add jitter to cache TTL (e.g., 6-8h random) for all catalog keys.service owner
  • P0Implement probabilistic early expiration (e.g., XFetch algorithm) to prevent stampedes.service owner
  • P1Add rate limiting to catalog-svc with circuit breaker for DB overload.on-call SRE
  • P1Review and improve cache pre-warming strategy to avoid simultaneous expiry.platform team
  • P2Set up alert on cache miss rate > 50% to detect stampedes early.on-call SRE