← All incidents

[Eval][v2][en] Cache stampede after Redis key expiry on Black Friday morning

service: catalog-svccreated: 6/3/2026, 12:07:57 AM

Raw incident context

Time: 09:00 UTC, Black Friday. catalog-svc latency exploded at exactly 09:00:00 UTC.

Symptoms:
- catalog-svc p99: 15s (baseline 80ms)
- 503 rate: 8% (intermittent during DB overload)
- Postgres CPU: 100% sustained, lock waits climbing
- Redis CPU: 25% (looks healthy)
- Redis cache miss rate for keys matching "catalog:item:*": 95% (baseline 2%)
- All product detail page requests are flooding through to DB

Background:
- We pre-warm the homepage catalog cache every night at 02:00 UTC with TTL=7h
- 02:00 UTC + 7h = 09:00 UTC ← all keys expired simultaneously
- Black Friday traffic ramp: 12x normal at 09:00 UTC (marketing email blast)
- No per-key locking; every cache miss triggers a fresh DB query

Logs:
- 08:59:58 UTC — last cache hit logged
- 09:00:00 UTC — first wave of "cache miss + DB query" log lines
- 09:00:02 UTC — DB connection pool hits ceiling, queries start queueing

On-call:
- 09:02 — pager (multiple alerts: DBHighCPU, CatalogSvcLatencyHigh, CheckoutErrorRate)
- 09:03 — diagnosis: cache stampede confirmed by Redis miss-rate metric
- 09:05 — debating: extend TTL (won't help in flight), warm cache manually (DB is the bottleneck), or shed load

Summary

SEV1

At 09:00 UTC on Black Friday, a cache stampede in catalog-svc caused all product detail cache keys to expire simultaneously, flooding Postgres with 12x normal traffic. DB CPU hit 100%, p99 latency spiked to 15s, and intermittent 503s occurred. The incident is ongoing with mitigation in progress.

Severity reasoning: User-facing outage: p99 latency 15s (baseline 80ms), 503 error rate 8% > 1% for >5 min, revenue path broken (product detail pages unavailable). Meets SEV1 criteria.

deepseek-chat·prompt v2·output: en·15156ms·1805↑ / 1937↓ tok·$0.00262

Root cause hypotheses

  • highCache TTL of 7h caused all keys to expire at 09:00 UTC, triggering a stampede under 12x Black Friday traffic.

    Evidence: Redis cache miss rate for 'catalog:item:*' jumped from 2% to 95% at 09:00 UTC; last cache hit at 08:59:58 UTC. Pre-warm at 02:00 UTC + 7h = 09:00 UTC.

  • highDB connection pool exhaustion due to concurrent queries from cache misses.

    Evidence: Logs at 09:00:02 UTC show DB connection pool hitting ceiling; queries queueing. DB CPU 100% sustained.

  • mediumMissing per-key locking or rate limiting on cache miss DB queries.

    Evidence: No evidence of locking in code; every cache miss triggers a fresh DB query without deduplication.

  • lowBlack Friday traffic ramp (12x) exceeded database read capacity even without stampede.

    Evidence: Traffic ramp from marketing email blast at 09:00 UTC; baseline DB CPU unknown but 12x likely saturates.

  • lowDatabase query performance degraded due to missing indexes or plan changes.

    Evidence: No direct evidence; DB CPU is 100% but lock waits climbing suggests contention, not slow queries.

Investigation checklist

  1. Check Redis cache miss rate for catalog keys to confirm stampede.
    redis-cli -h redis-cache -p 6379 INFO stats | grep -i 'keyspace_misses'

    Expected: Keyspace miss rate > 90% for catalog:item:* keys.

  2. Check DB active queries and wait events to identify contention.
    SELECT pid, query_start, state, wait_event_type, wait_event, query FROM pg_stat_activity WHERE state != 'idle' ORDER BY query_start LIMIT 20;

    Expected: Many queries waiting on 'LWLock' or 'IO' with long query_start times.

  3. Check DB connection pool usage.
    SELECT count(*) FROM pg_stat_activity WHERE state != 'idle';

    Expected: Count near max_connections (e.g., 200).

  4. Check catalog-svc error logs for 503s.
    kubectl logs -n prod -l app=catalog-svc --since=15m | grep -i '503' | tail -20

    Expected: Lines showing 'HTTP 503' or 'connection refused'.

  5. Check Redis CPU and memory to rule out Redis issue.
    redis-cli -h redis-cache -p 6379 INFO CPU | grep 'used_cpu_sys'

    Expected: CPU usage around 25% (normal).

  6. Check if any catalog-svc pods are crashing or OOMKilled.
    kubectl get pods -n prod -l app=catalog-svc -o wide | grep -v Running

    Expected: All pods Running (no crashes).

Mitigation plan

  • Shed load by temporarily blocking non-critical traffic to catalog-svc (e.g., disable product detail for non-paying users via feature flag).

    Risk: Reduces revenue from affected users; may cause confusion.

    Rollback: Re-enable feature flag once DB CPU drops below 80%.

  • Increase Redis cache TTL for catalog keys to 24h and manually warm cache with a script that reads from a read replica (not primary).

    Risk: Manual warm may further load DB if not throttled; use rate limiting (e.g., 100 queries/sec).

    Rollback: Revert TTL change and clear warmed keys if DB load increases.

  • Add per-key locking (e.g., Redis SETNX) to deduplicate concurrent cache miss queries.

    Risk: Lock contention may increase latency; requires code change and deployment.

    Rollback: Revert code change and redeploy.

  • Scale up DB read replicas and redirect catalog queries to replicas.

    Risk: Replica lag may serve stale data; provisioning takes minutes.

    Rollback: Remove replicas and redirect back to primary.

Customer impact

Users are experiencing slow loading (up to 15 seconds) and intermittent errors (503) when viewing product detail pages. This affects all users browsing product pages during Black Friday. Estimated impact: 100% of catalog traffic. No ETA for full recovery.

Postmortem draft

Summary

[FILL IN]

Timeline (UTC)

  • 08:59:58 — Last cache hit logged
  • 09:00:00 — Cache miss rate spikes to 95%; DB CPU begins climbing
  • 09:00:02 — DB connection pool exhausted; queries queue
  • 09:02 — Pager alerts (DBHighCPU, CatalogSvcLatencyHigh, CheckoutErrorRate)
  • 09:03 — Diagnosis: cache stampede confirmed
  • 09:05 — Mitigation started: load shedding and manual cache warm
  • [FILL IN] — Incident resolved

Impact

  • p99 latency: 15s (baseline 80ms)
  • Error rate: 8% 503s
  • Affected users: all product detail page requests
  • Revenue impact: [FILL IN]

Root Cause

Cache stampede: all catalog cache keys expired simultaneously at 09:00 UTC due to 7h TTL from 02:00 UTC pre-warm, coinciding with 12x Black Friday traffic ramp.

Detection

  • Automated alerts: DBHighCPU, CatalogSvcLatencyHigh, CheckoutErrorRate
  • Manual: on-call engineer noticed Redis miss rate spike

Response

  • Load shedding via feature flag
  • Manual cache warm from read replica
  • Increased TTL to 24h
  • [FILL IN]

What Went Well

  • Alerting caught the issue within 2 minutes
  • On-call quickly identified stampede via Redis metrics

What Went Poorly

  • No per-key locking or cache stampede protection
  • TTL alignment with traffic peak not considered
  • No load testing for Black Friday traffic patterns

Action Items

  • [FILL IN]

Follow-ups

  • P0Implement per-key locking (SETNX) for cache miss DB queries.catalog-svc team
  • P0Add jitter to cache TTLs to prevent simultaneous expiration.catalog-svc team
  • P1Set up alert on Redis cache miss rate > 50%.platform team
  • P1Load test catalog-svc under 15x Black Friday traffic.QA team
  • P2Review and update runbook for cache stampede mitigation.on-call SRE
  • P2Evaluate read replica scaling for catalog queries.platform team