← All incidents

[Eval][v2][en] Cache stampede after Redis key expiry on Black Friday morning

service: catalog-svccreated: 6/3/2026, 12:08:14 AM

Raw incident context

Time: 09:00 UTC, Black Friday. catalog-svc latency exploded at exactly 09:00:00 UTC.

Symptoms:
- catalog-svc p99: 15s (baseline 80ms)
- 503 rate: 8% (intermittent during DB overload)
- Postgres CPU: 100% sustained, lock waits climbing
- Redis CPU: 25% (looks healthy)
- Redis cache miss rate for keys matching "catalog:item:*": 95% (baseline 2%)
- All product detail page requests are flooding through to DB

Background:
- We pre-warm the homepage catalog cache every night at 02:00 UTC with TTL=7h
- 02:00 UTC + 7h = 09:00 UTC ← all keys expired simultaneously
- Black Friday traffic ramp: 12x normal at 09:00 UTC (marketing email blast)
- No per-key locking; every cache miss triggers a fresh DB query

Logs:
- 08:59:58 UTC — last cache hit logged
- 09:00:00 UTC — first wave of "cache miss + DB query" log lines
- 09:00:02 UTC — DB connection pool hits ceiling, queries start queueing

On-call:
- 09:02 — pager (multiple alerts: DBHighCPU, CatalogSvcLatencyHigh, CheckoutErrorRate)
- 09:03 — diagnosis: cache stampede confirmed by Redis miss-rate metric
- 09:05 — debating: extend TTL (won't help in flight), warm cache manually (DB is the bottleneck), or shed load

Summary

SEV1

At 09:00 UTC on Black Friday, a cache stampede in catalog-svc caused all product detail cache keys to expire simultaneously, flooding Postgres with 12x normal traffic. DB CPU hit 100%, p99 latency spiked to 15s, and intermittent 503s occurred. Redis cache miss rate jumped from 2% to 95%.

Severity reasoning: User-facing outage: p99 latency 15s (baseline 80ms), 503 error rate 8% > 1% for >5 min, revenue path broken (Black Friday traffic). Meets SEV1 criteria.

deepseek-chat·prompt v2·output: en·11826ms·1805↑ / 1529↓ tok·$0.00217

Root cause hypotheses

  • highCache TTL of 7h caused all keys to expire at 09:00 UTC, triggering a stampede under 12x traffic.

    Evidence: Redis cache miss rate for 'catalog:item:*' jumped from 2% to 95% at 09:00 UTC; last cache hit at 08:59:58 UTC.

  • highDB connection pool exhausted by concurrent queries from cache misses, leading to queueing and latency.

    Evidence: DB CPU 100% sustained, lock waits climbing; connection pool hit ceiling at 09:00:02 UTC.

  • mediumMissing per-key locking or request coalescing allowed multiple requests to query DB for the same key.

    Evidence: No per-key locking in code; every cache miss triggers a fresh DB query.

  • lowBlack Friday traffic ramp (12x) exceeded system capacity even with cache hits.

    Evidence: Traffic ramp from marketing email blast; baseline cache hit rate would still result in high load.

Investigation checklist

  1. Check Redis cache miss rate for catalog keys.
    redis-cli -h redis-cache -p 6379 INFO stats | grep -i 'keyspace_misses'

    Expected: High miss rate (>90%) confirms stampede.

  2. Check Postgres active queries and locks.
    SELECT pid, query_start, state, wait_event_type, query FROM pg_stat_activity WHERE state != 'idle' ORDER BY query_start LIMIT 20;

    Expected: Many queries waiting on locks or running long (>1s).

  3. Check catalog-svc error rate and latency.
    kubectl logs -n prod -l app=catalog-svc --since=10m | grep -E '503|5[0-9][0-9]' | tail -20

    Expected: High rate of 503 responses.

  4. Verify cache TTL configuration.
    kubectl exec -n prod deploy/catalog-svc -- cat /app/config.yaml | grep -A2 'cache.*ttl'

    Expected: TTL=7h for catalog:item keys.

  5. Check if per-key locking exists in code.
    grep -r 'lock.*cache' /path/to/catalog-svc/src/

    Expected: No locking mechanism found.

Mitigation plan

  • Immediately increase Redis cache TTL for catalog:item keys to 24h to prevent near-term re-expiry.

    Risk: Stale data served for up to 24h; acceptable during Black Friday.

    Rollback: Revert TTL to 7h after incident.

  • Manually warm cache by running a script to re-query popular items from DB and set in Redis with new TTL.

    Risk: Additional load on DB; must throttle to avoid worsening stampede.

    Rollback: Stop the warm script; DB load will decrease as cache fills naturally.

  • Enable request coalescing for cache misses (e.g., only one request per key queries DB, others wait).

    Risk: Requires code change; cannot deploy quickly. Consider feature flag.

    Rollback: Disable feature flag.

Customer impact

Customers browsing product details on Black Friday experience slow page loads (up to 15 seconds) and intermittent errors (503). All users are affected. Estimated impact: 100% of traffic during the incident.

Postmortem draft

Summary

[FILL IN]

Timeline (UTC)

  • 08:59:58 — Last cache hit
  • 09:00:00 — Cache miss wave begins
  • 09:00:02 — DB connection pool exhausted
  • 09:02 — Pager alerts
  • 09:03 — Cache stampede confirmed
  • [FILL IN] — Mitigation actions

Impact

  • p99 latency 15s (baseline 80ms)
  • 503 error rate 8%
  • DB CPU 100%

Root Cause

Cache TTL of 7h caused all catalog:item keys to expire at 09:00 UTC, coinciding with 12x Black Friday traffic, leading to a stampede.

Detection

Alerts: DBHighCPU, CatalogSvcLatencyHigh, CheckoutErrorRate.

Response

  • Increased TTL to 24h
  • Manual cache warm (throttled)
  • [FILL IN]

What Went Well

  • Redis remained healthy
  • On-call quickly identified stampede

What Went Poorly

  • No per-key locking
  • Cache TTL not staggered
  • No pre-warming for Black Friday

Action Items

  • [FILL IN]

Follow-ups

  • P0Implement per-key locking or request coalescing for cache misses.catalog-svc team
  • P1Stagger cache TTLs to prevent mass expiration (e.g., add jitter).platform team
  • P1Add alert on cache miss rate spike >50%.observability team
  • P1Pre-warm cache before major traffic events (Black Friday).SRE team
  • P2Review DB connection pool sizing and add queueing limits.platform team