← All incidents

[Eval][v2][en] Cache stampede after Redis key expiry on Black Friday morning

service: catalog-svccreated: 6/3/2026, 12:07:36 AM

Raw incident context

Time: 09:00 UTC, Black Friday. catalog-svc latency exploded at exactly 09:00:00 UTC.

Symptoms:
- catalog-svc p99: 15s (baseline 80ms)
- 503 rate: 8% (intermittent during DB overload)
- Postgres CPU: 100% sustained, lock waits climbing
- Redis CPU: 25% (looks healthy)
- Redis cache miss rate for keys matching "catalog:item:*": 95% (baseline 2%)
- All product detail page requests are flooding through to DB

Background:
- We pre-warm the homepage catalog cache every night at 02:00 UTC with TTL=7h
- 02:00 UTC + 7h = 09:00 UTC ← all keys expired simultaneously
- Black Friday traffic ramp: 12x normal at 09:00 UTC (marketing email blast)
- No per-key locking; every cache miss triggers a fresh DB query

Logs:
- 08:59:58 UTC — last cache hit logged
- 09:00:00 UTC — first wave of "cache miss + DB query" log lines
- 09:00:02 UTC — DB connection pool hits ceiling, queries start queueing

On-call:
- 09:02 — pager (multiple alerts: DBHighCPU, CatalogSvcLatencyHigh, CheckoutErrorRate)
- 09:03 — diagnosis: cache stampede confirmed by Redis miss-rate metric
- 09:05 — debating: extend TTL (won't help in flight), warm cache manually (DB is the bottleneck), or shed load

Summary

SEV1

At 09:00 UTC on Black Friday, a cache stampede in catalog-svc caused all product detail page requests to hit the database simultaneously, spiking Postgres CPU to 100%, p99 latency to 15s, and causing intermittent 503s. The stampede was triggered by the expiration of pre-warmed cache keys (TTL=7h) at the same time as a 12x traffic surge from a marketing email blast.

Severity reasoning: User-facing outage: p99 latency 15s (baseline 80ms), 503 error rate 8% > 1% for >5 min, revenue path broken (product detail pages are critical for Black Friday sales). This meets SEV1 criteria.

deepseek-chat·prompt v2·output: en·15341ms·1805↑ / 1870↓ tok·$0.00254

Root cause hypotheses

  • highCache stampede due to simultaneous key expiration and traffic surge

    Evidence: Redis cache miss rate for 'catalog:item:*' jumped from 2% to 95% at 09:00 UTC, coinciding with TTL expiry of pre-warmed keys (set at 02:00 UTC, TTL=7h). DB CPU hit 100% immediately after.

  • highDB connection pool exhaustion from concurrent queries

    Evidence: Logs at 09:00:02 UTC show connection pool hitting ceiling and queries queueing. Lock waits climbing.

  • mediumMissing per-key locking or rate limiting on cache miss

    Evidence: No evidence of locking in code; every cache miss triggers a fresh DB query. This is a design gap.

  • lowDatabase query performance degradation (e.g., missing index, slow query)

    Evidence: No specific slow query identified yet; DB CPU is 100% but could be due to volume rather than individual slow queries.

  • lowNetwork latency or DNS issue causing retries

    Evidence: No network errors in logs; Redis CPU is normal (25%), ruling out network as primary cause.

Investigation checklist

  1. Check Redis cache miss rate for catalog keys
    redis-cli -h redis-cache -p 6379 INFO stats | grep -i 'keyspace_misses'

    Expected: High miss rate (>90%) confirms stampede

  2. Check DB active connections and wait events
    SELECT count(*), state, wait_event_type FROM pg_stat_activity WHERE state != 'idle' GROUP BY state, wait_event_type;

    Expected: Many connections in 'active' state with wait_event_type 'Lock' or 'IO'

  3. Check DB CPU usage
    kubectl top pod -n prod -l app=catalog-db --containers | grep -i postgres

    Expected: CPU usage near 100%

  4. Check catalog-svc error logs for 503s
    kubectl logs -n prod -l app=catalog-svc --since=10m | grep -E 'HTTP 503|timeout|connection refused' | tail -20

    Expected: Multiple 503 errors due to upstream DB timeout

  5. Verify cache key TTL and expiry pattern
    redis-cli -h redis-cache -p 6379 TTL catalog:item:popular | head -5

    Expected: TTL near 0 or negative for many keys

  6. Check if any recent deployment changed cache logic
    kubectl rollout history deployment/catalog-svc -n prod

    Expected: No recent changes; stampede is operational

Mitigation plan

  • Shed load by temporarily blocking non-critical traffic to catalog-svc (e.g., disable product recommendations via feature flag)

    Risk: Reduces feature functionality but protects DB; may impact user experience but prevents total outage

    Rollback: Re-enable feature flag once DB CPU drops below 70%

  • Increase Redis cache TTL for catalog keys to 12h to prevent immediate re-expiry

    Risk: Stale data for up to 12h; acceptable during Black Friday

    Rollback: Revert TTL to 7h after incident

  • Manually warm cache by running a script to re-populate popular keys from DB (if DB load allows)

    Risk: Running additional DB queries could worsen load; only do if DB CPU < 80%

    Rollback: Kill the warm-up script if DB CPU exceeds 90%

  • Scale up catalog-db read replicas to handle query load

    Risk: May take minutes to provision; temporary cost increase

    Rollback: Scale down after cache warms

Customer impact

Users browsing product detail pages on the website and mobile app are experiencing slow load times (up to 15 seconds) and intermittent errors (503). This affects all users during Black Friday shopping, estimated 12x normal traffic. No data loss has occurred.

Postmortem draft

Summary

[FILL IN]

Timeline (UTC)

  • 02:00: Pre-warm cache job runs, sets TTL=7h
  • 08:59:58: Last cache hit
  • 09:00:00: Cache stampede begins; all keys expire simultaneously
  • 09:00:02: DB connection pool exhausted
  • 09:02: Pager alerts triggered
  • 09:03: Diagnosis: cache stampede
  • [FILL IN]: Mitigation actions taken

Impact

  • p99 latency 15s (baseline 80ms)
  • 503 error rate 8%
  • DB CPU 100% for [duration]
  • Affected all product detail page requests

Root Cause

Simultaneous expiration of pre-warmed cache keys (TTL=7h) at peak traffic, causing a cache stampede. No per-key locking or rate limiting on cache miss.

Detection

Alerts: DBHighCPU, CatalogSvcLatencyHigh, CheckoutErrorRate. Redis cache miss rate metric confirmed stampede.

Response

  • [FILL IN: actions taken, e.g., load shedding, TTL increase, cache warm]

What Went Well

  • Redis cache miss metric was key to quick diagnosis
  • Feature flags allowed rapid load shedding

What Went Poorly

  • Cache TTL not staggered to avoid simultaneous expiry
  • No per-key locking or rate limiting on cache miss
  • Pre-warm job did not account for traffic surge timing

Action Items

  • [FILL IN: e.g., implement per-key locking, stagger TTLs, add circuit breaker]

Follow-ups

  • P0Implement per-key locking (e.g., SETNX) on cache miss to prevent stampedecatalog-svc team
  • P0Stagger cache TTLs with jitter to avoid simultaneous expirycatalog-svc team
  • P1Add circuit breaker for DB queries from catalog-svcplatform team
  • P1Improve pre-warm job to align with traffic patterns and add random TTL offsetdata engineering team
  • P1Set up alert on cache miss rate spike (e.g., >50% deviation from baseline)observability team
  • P2Review and update incident response runbook for cache stampede scenarioson-call SRE