← All incidents

[Eval][v1][en] Cache stampede after Redis key expiry on Black Friday morning

service: catalog-svccreated: 6/2/2026, 11:41:01 PM

Raw incident context

Time: 09:00 UTC, Black Friday. catalog-svc latency exploded at exactly 09:00:00 UTC.

Symptoms:
- catalog-svc p99: 15s (baseline 80ms)
- 503 rate: 8% (intermittent during DB overload)
- Postgres CPU: 100% sustained, lock waits climbing
- Redis CPU: 25% (looks healthy)
- Redis cache miss rate for keys matching "catalog:item:*": 95% (baseline 2%)
- All product detail page requests are flooding through to DB

Background:
- We pre-warm the homepage catalog cache every night at 02:00 UTC with TTL=7h
- 02:00 UTC + 7h = 09:00 UTC ← all keys expired simultaneously
- Black Friday traffic ramp: 12x normal at 09:00 UTC (marketing email blast)
- No per-key locking; every cache miss triggers a fresh DB query

Logs:
- 08:59:58 UTC — last cache hit logged
- 09:00:00 UTC — first wave of "cache miss + DB query" log lines
- 09:00:02 UTC — DB connection pool hits ceiling, queries start queueing

On-call:
- 09:02 — pager (multiple alerts: DBHighCPU, CatalogSvcLatencyHigh, CheckoutErrorRate)
- 09:03 — diagnosis: cache stampede confirmed by Redis miss-rate metric
- 09:05 — debating: extend TTL (won't help in flight), warm cache manually (DB is the bottleneck), or shed load

Summary

SEV1

At 09:00 UTC on Black Friday, a cache stampede in catalog-svc caused all product detail page requests to hit Postgres simultaneously after a pre-warmed cache (TTL=7h) expired. Combined with 12x normal traffic from a marketing email blast, DB CPU reached 100%, p99 latency spiked to 15s, and 8% of requests returned 503. Redis cache miss rate jumped from 2% to 95%.

Severity reasoning: User-facing outage with 8% error rate and 15s latency, revenue impact on Black Friday, and potential for cascading failures to downstream services (e.g., checkout).

deepseek-chat·prompt v1·output: en·11678ms·1424↑ / 1826↓ tok·$0.00239

Root cause hypotheses

  • highCache TTL expiration at 09:00 UTC caused all keys to expire simultaneously, triggering a stampede.

    Evidence: Redis cache miss rate for 'catalog:item:*' jumped from 2% to 95% at 09:00 UTC; last cache hit at 08:59:58 UTC.

  • highDB connection pool exhaustion due to concurrent queries from cache misses.

    Evidence: DB CPU 100%, lock waits climbing, connection pool hitting ceiling at 09:00:02 UTC.

  • mediumBlack Friday traffic spike (12x normal) overwhelmed the DB even without cache stampede.

    Evidence: Marketing email blast at 09:00 UTC; traffic ramp confirmed.

  • highMissing per-key locking or request coalescing in catalog-svc.

    Evidence: No per-key locking; every cache miss triggers a fresh DB query.

  • lowDatabase query performance degradation due to missing index or plan change.

    Evidence: No evidence of plan changes; baseline latency was 80ms before stampede.

Investigation checklist

  1. Check Redis cache miss rate for catalog:item:* keys.
    redis-cli INFO stats | grep keyspace_misses; redis-cli --raw KEYS 'catalog:item:*' | wc -l

    Expected: High miss rate (95%) and low key count (should be ~12x normal).

  2. Check Postgres active queries and lock waits.
    SELECT pid, state, wait_event, query FROM pg_stat_activity WHERE state = 'active' ORDER BY query_start;

    Expected: Many queries waiting on 'relation' or 'tuple' locks, all selecting from catalog_items.

  3. Check DB connection pool usage.
    SELECT count(*) FROM pg_stat_activity;

    Expected: Count near max_connections (e.g., 200).

  4. Check catalog-svc error logs for 503s.
    kubectl logs -l app=catalog-svc --tail=1000 | grep '503' | head -20

    Expected: Lines like 'http: response 503' with 'connection refused' or 'timeout'.

  5. Verify cache TTL configuration.
    kubectl exec deploy/catalog-svc -- cat /app/config.yaml | grep -A2 'cache'

    Expected: TTL: 7h, pre-warm at 02:00 UTC.

  6. Check if any feature flag can disable catalog-svc or serve stale cache.
    kubectl exec deploy/catalog-svc -- curl -s localhost:8080/health | jq '.features'

    Expected: Feature flags like 'stale_cache_enabled' or 'catalog_fallback'.

Mitigation plan

  • Enable stale cache serving (serve stale data while refreshing in background) if feature exists.

    Risk: Users may see slightly outdated product info (price, stock).

    Rollback: Disable stale cache feature flag.

  • Manually warm Redis cache with a script that reads from a DB replica (not primary) with rate limiting.

    Risk: Script could overload DB replica; must use throttling (e.g., 100 queries/sec).

    Rollback: Kill the script; revert to current state.

  • Increase DB connection pool and add read replicas for catalog queries.

    Risk: Temporary increase may mask underlying issue; replicas may lag.

    Rollback: Reduce pool size and remove replicas after incident.

  • Shed load by rate-limiting catalog-svc at ingress (e.g., 50% of traffic) or redirect to a static fallback page.

    Risk: Some users will see degraded experience or error page.

    Rollback: Remove rate limit or redirect.

Customer impact

All users browsing product detail pages on Black Friday are experiencing 15-second load times and intermittent errors (8% 503s). Checkout may be indirectly affected if catalog-svc is a dependency. No data loss expected.

Postmortem draft

Postmortem: Catalog Cache Stampede on Black Friday

Summary: At 09:00 UTC, a pre-warmed cache with 7h TTL expired simultaneously, causing a stampede to Postgres. Combined with 12x traffic, DB CPU hit 100%, p99 latency 15s, 8% 503s.

Timeline:

  • 02:00 UTC: Cache pre-warmed (TTL=7h)
  • 09:00 UTC: Cache expires, traffic spike from marketing email
  • 09:02 UTC: Pager alerts
  • 09:03 UTC: Diagnosis: cache stampede
  • 09:05 UTC: Mitigation actions started

Impact: 8% error rate, 15s latency for ~30 minutes. Revenue impact TBD.

Root Cause: Cache TTL expiration without jitter or per-key locking, combined with traffic spike.

What Went Well:

  • Monitoring detected issue within 2 minutes
  • Redis metrics clearly showed cache miss rate

What Went Poorly:

  • No stale cache serving
  • No request coalescing
  • TTL alignment with traffic spike

Action Items:

  • [ ] Add jitter to cache TTL (P0)
  • [ ] Implement per-key locking / request coalescing (P0)
  • [ ] Enable stale cache serving (P1)
  • [ ] Add load shedding at ingress (P1)
  • [ ] Review Black Friday traffic patterns (P2)

Follow-ups

  • P0Add random jitter to cache TTL to prevent simultaneous expiration.service owner
  • P0Implement per-key locking or request coalescing for cache misses.service owner
  • P1Enable stale cache serving with background refresh.service owner
  • P1Add ingress rate limiting and load shedding for catalog-svc.platform team
  • P2Review and adjust pre-warm schedule to avoid peak traffic alignment.service owner
  • P1Set up alert on cache miss rate spike.on-call SRE
  • P2Document incident response steps for cache stampede.on-call SRE