← All incidents

[Eval][v2][en] Cache stampede after Redis key expiry on Black Friday morning

service: catalog-svccreated: 5/25/2026, 10:24:14 PM

Raw incident context

Time: 09:00 UTC, Black Friday. catalog-svc latency exploded at exactly 09:00:00 UTC.

Symptoms:
- catalog-svc p99: 15s (baseline 80ms)
- 503 rate: 8% (intermittent during DB overload)
- Postgres CPU: 100% sustained, lock waits climbing
- Redis CPU: 25% (looks healthy)
- Redis cache miss rate for keys matching "catalog:item:*": 95% (baseline 2%)
- All product detail page requests are flooding through to DB

Background:
- We pre-warm the homepage catalog cache every night at 02:00 UTC with TTL=7h
- 02:00 UTC + 7h = 09:00 UTC ← all keys expired simultaneously
- Black Friday traffic ramp: 12x normal at 09:00 UTC (marketing email blast)
- No per-key locking; every cache miss triggers a fresh DB query

Logs:
- 08:59:58 UTC — last cache hit logged
- 09:00:00 UTC — first wave of "cache miss + DB query" log lines
- 09:00:02 UTC — DB connection pool hits ceiling, queries start queueing

On-call:
- 09:02 — pager (multiple alerts: DBHighCPU, CatalogSvcLatencyHigh, CheckoutErrorRate)
- 09:03 — diagnosis: cache stampede confirmed by Redis miss-rate metric
- 09:05 — debating: extend TTL (won't help in flight), warm cache manually (DB is the bottleneck), or shed load

Summary

SEV1

At 09:00 UTC on Black Friday, a cache stampede in catalog-svc caused all product detail cache keys to expire simultaneously, flooding Postgres with 12x normal traffic. DB CPU hit 100%, p99 latency spiked to 15s, and intermittent 503s occurred. Redis cache miss rate jumped from 2% to 95%.

Severity reasoning: User-facing outage: p99 latency 15s (baseline 80ms), 503 error rate 8% > 1% for >5 min, revenue path broken (product detail pages essential for Black Friday sales).

deepseek-chat·prompt v2·output: en·12977ms

Root cause hypotheses

  • highCache TTL of 7h caused all keys to expire at 09:00 UTC, triggering a stampede under 12x traffic.

    Evidence: Redis cache miss rate for 'catalog:item:*' jumped from 2% to 95% at 09:00 UTC; last cache hit at 08:59:58 UTC.

  • highDB connection pool exhausted by concurrent queries from cache misses, causing queueing and latency.

    Evidence: DB CPU 100% sustained, lock waits climbing, connection pool hit ceiling at 09:00:02 UTC.

  • mediumMissing per-key locking or rate limiting on cache miss DB queries allowed all requests to hit DB simultaneously.

    Evidence: No per-key locking in code; every cache miss triggers a fresh DB query without coordination.

  • mediumPre-warm cache at 02:00 UTC with 7h TTL was insufficient for Black Friday traffic ramp starting at 09:00.

    Evidence: Marketing email blast at 09:00 UTC caused 12x traffic ramp; cache was not refreshed to cover the peak.

  • lowDatabase query performance degraded due to missing indexes or inefficient queries under load.

    Evidence: No direct evidence; DB CPU is 100% but could be from simple queries under high concurrency.

Investigation checklist

  1. Check Redis cache miss rate for catalog keys to confirm stampede.
    redis-cli -h redis-cache -p 6379 INFO stats | grep -i 'keyspace_misses'

    Expected: Keyspace miss rate > 90% (currently 95%)

  2. Check Postgres active queries and wait events.
    psql -h pg-primary -U postgres -c "SELECT pid, query_start, state, wait_event_type, query FROM pg_stat_activity WHERE state != 'idle' ORDER BY query_start LIMIT 20;"

    Expected: Many queries with state 'active' and wait_event_type 'Lock' or 'IO'

  3. Check DB connection pool usage.
    kubectl exec -n prod deployment/catalog-svc -- curl -s localhost:8080/actuator/metrics/hikaricp.connections.active

    Expected: Active connections near max pool size (e.g., 50/50)

  4. Check catalog-svc error logs for 503s.
    kubectl logs -n prod -l app=catalog-svc --since=15m | grep -i '503' | head -20

    Expected: Lines with HTTP 503 status codes

  5. Verify cache TTL configuration.
    kubectl exec -n prod deployment/catalog-svc -- cat /app/config/application.yml | grep -A2 'cache'

    Expected: TTL: 7h for catalog:item keys

Mitigation plan

  • Shed load by temporarily scaling down non-critical traffic: block catalog-svc requests from internal batch jobs via network policy.

    Risk: Batch jobs may fail or retry, but no user impact. Blast radius limited to internal systems.

    Rollback: Remove network policy to restore traffic.

  • Increase DB connection pool size temporarily to reduce queueing (e.g., from 50 to 100).

    Risk: May increase DB CPU further; monitor closely. Blast radius: DB performance.

    Rollback: Revert pool size to original value.

  • Manually warm cache by running a script that re-queries DB for popular items and sets cache keys with staggered TTLs.

    Risk: Script will add load to DB; run with rate limiting (e.g., 10 queries/sec). Blast radius: DB load.

    Rollback: Kill the script if DB CPU exceeds 95%.

  • Enable per-key locking (e.g., Redis SETNX) to prevent concurrent DB queries for the same key.

    Risk: Code change requires deployment; may introduce latency if lock contention high. Blast radius: catalog-svc.

    Rollback: Revert the code change and redeploy.

Customer impact

Users on Black Friday are experiencing slow product page loads (up to 15 seconds) and intermittent errors (503) when viewing items. Approximately 12x normal traffic is affected. No ETA yet.

Postmortem draft

Summary

[FILL IN]

Timeline (UTC)

  • 08:59:58 — Last cache hit logged
  • 09:00:00 — Cache miss wave begins; DB CPU spikes
  • 09:00:02 — DB connection pool exhausted
  • 09:02 — Pager alerts triggered
  • 09:03 — Cache stampede confirmed
  • [FILL IN] — Mitigation actions taken

Impact

  • p99 latency 15s (baseline 80ms)
  • 503 error rate 8%
  • Affected all product detail page requests

Root Cause

Cache TTL of 7h caused all catalog:item keys to expire at 09:00 UTC, coinciding with Black Friday traffic ramp (12x). No per-key locking or rate limiting on cache miss DB queries led to stampede.

Detection

Alerts: DBHighCPU, CatalogSvcLatencyHigh, CheckoutErrorRate. Redis cache miss metric confirmed stampede.

Response

  • Shed load from batch jobs
  • Increased DB connection pool
  • Manual cache warm with rate limiting
  • Enabled per-key locking (code change)

What Went Well

  • Monitoring detected issue quickly
  • Redis metrics clearly indicated stampede

What Went Poorly

  • Cache TTL not aligned with traffic patterns
  • No per-key locking or rate limiting
  • Pre-warm strategy insufficient for Black Friday

Action Items

  • [FILL IN]

Follow-ups

  • P0Implement staggered cache TTLs to prevent simultaneous expiration.service owner
  • P0Add per-key locking (e.g., Redis SETNX) for cache miss DB queries.service owner
  • P1Add rate limiting for cache miss DB queries to prevent stampede.service owner
  • P1Review and adjust cache pre-warm schedule for peak traffic days.platform team
  • P1Add alert on cache miss rate exceeding threshold (e.g., >10%).on-call SRE
  • P2Document incident response steps for cache stampede scenarios.on-call SRE