← All incidents

[Eval][v1][en] Cache stampede after Redis key expiry on Black Friday morning

service: catalog-svccreated: 5/25/2026, 10:23:37 PM

Raw incident context

Time: 09:00 UTC, Black Friday. catalog-svc latency exploded at exactly 09:00:00 UTC.

Symptoms:
- catalog-svc p99: 15s (baseline 80ms)
- 503 rate: 8% (intermittent during DB overload)
- Postgres CPU: 100% sustained, lock waits climbing
- Redis CPU: 25% (looks healthy)
- Redis cache miss rate for keys matching "catalog:item:*": 95% (baseline 2%)
- All product detail page requests are flooding through to DB

Background:
- We pre-warm the homepage catalog cache every night at 02:00 UTC with TTL=7h
- 02:00 UTC + 7h = 09:00 UTC ← all keys expired simultaneously
- Black Friday traffic ramp: 12x normal at 09:00 UTC (marketing email blast)
- No per-key locking; every cache miss triggers a fresh DB query

Logs:
- 08:59:58 UTC — last cache hit logged
- 09:00:00 UTC — first wave of "cache miss + DB query" log lines
- 09:00:02 UTC — DB connection pool hits ceiling, queries start queueing

On-call:
- 09:02 — pager (multiple alerts: DBHighCPU, CatalogSvcLatencyHigh, CheckoutErrorRate)
- 09:03 — diagnosis: cache stampede confirmed by Redis miss-rate metric
- 09:05 — debating: extend TTL (won't help in flight), warm cache manually (DB is the bottleneck), or shed load

Summary

SEV1

At 09:00 UTC on Black Friday, a cache stampede caused catalog-svc p99 latency to spike from 80ms to 15s, with 8% 503 errors. Pre-warmed Redis keys with 7h TTL expired simultaneously, and 12x normal traffic from a marketing email blast flooded the DB, saturating Postgres CPU at 100% and causing connection pool exhaustion.

Severity reasoning: User-facing outage with 8% error rate and 15s latency, revenue impact on Black Friday, and potential data inconsistency from DB overload. Full outage of a critical customer-facing service.

deepseek-chat·prompt v1·output: en·12413ms

Root cause hypotheses

  • highCache stampede from simultaneous key expiration at 09:00 UTC

    Evidence: Redis cache miss rate for 'catalog:item:*' jumped to 95% at 09:00 UTC, coinciding with TTL expiry of pre-warmed keys (set at 02:00 UTC, TTL=7h).

  • highDB connection pool exhaustion due to surge of uncached queries

    Evidence: Logs show DB connection pool hit ceiling at 09:00:02 UTC, queries started queueing. Postgres CPU at 100% sustained.

  • highBlack Friday traffic spike overwhelming DB capacity

    Evidence: Traffic ramp of 12x normal at 09:00 UTC from marketing email blast, combined with cache miss, caused DB overload.

  • mediumMissing per-key locking or rate limiting on cache miss

    Evidence: No per-key locking; every cache miss triggers a fresh DB query, leading to stampede. This is a design flaw.

Investigation checklist

  1. Confirm cache miss rate and key expiration pattern
    redis-cli --raw -h <redis-host> INFO keyspace | grep catalog && redis-cli --raw -h <redis-host> TTL catalog:item:12345

    Expected: High miss rate and TTL near 0 for sample keys.

  2. Check DB connection pool usage and active queries
    kubectl exec -n prod catalog-svc-0 -- psql -c "SELECT count(*) FROM pg_stat_activity WHERE state = 'active';"

    Expected: Active connections near max_pool size (e.g., 100/100).

  3. Identify slow queries contributing to CPU saturation
    kubectl exec -n prod catalog-svc-0 -- psql -c "SELECT query, calls, total_time, mean_time FROM pg_stat_statements ORDER BY total_time DESC LIMIT 10;"

    Expected: Top queries are SELECT from catalog_item table with high total_time.

  4. Check if DB is CPU-bound or I/O-bound
    kubectl exec -n prod catalog-svc-0 -- psql -c "SELECT pg_stat_get_backend_pid(s.backendid), s.wait_event_type FROM pg_stat_activity s WHERE s.state = 'active';"

    Expected: Most wait_event_type are 'CPU' or 'LWLock' (lock contention).

  5. Verify Redis is not the bottleneck
    redis-cli --raw -h <redis-host> INFO CPU | grep used_cpu_sys

    Expected: CPU usage low (25% as reported).

Mitigation plan

  • Immediately shed load by rate-limiting catalog-svc at the ingress gateway to 50% of current traffic

    Risk: Some users will see 429 errors, but prevents total outage. Destructive if not reverted.

    Rollback: Gradually increase rate limit back to 100% after cache is warmed.

  • Manually warm Redis cache with a script that reads from a DB replica (not primary) to avoid further load

    Risk: If replica is also overloaded, this could worsen. Use a dedicated read-only replica if available.

    Rollback: Kill the warm-up script if DB load increases beyond 90%.

  • Extend TTL on existing keys to prevent immediate re-expiry (won't help in-flight but prevents second wave)

    Risk: Stale data may be served for longer, but acceptable during incident.

    Rollback: Set TTL back to original after incident.

Customer impact

Customers on Black Friday are experiencing 15-second load times and intermittent 503 errors when viewing product detail pages. Approximately 8% of requests fail. No data loss, but degraded experience for all users hitting catalog-svc.

Postmortem draft

Postmortem: Catalog-svc Cache Stampede on Black Friday

Summary

  • Date: 2025-01-28
  • Duration: 09:00 - 09:XX UTC
  • Severity: SEV1
  • Impact: 8% error rate, 15s p99 latency for catalog-svc

Timeline

  • 08:59:58 UTC: Last cache hit
  • 09:00:00 UTC: Cache miss wave begins
  • 09:00:02 UTC: DB connection pool exhausted
  • 09:02 UTC: Pager received
  • 09:03 UTC: Diagnosis: cache stampede
  • 09:05 UTC: Mitigation started (rate limiting, manual cache warm)

Impact

  • Users: All catalog-svc users
  • Revenue: Estimated $X loss
  • Data: No loss

Root Cause

  • Pre-warmed cache keys with 7h TTL expired simultaneously at 09:00 UTC
  • Black Friday traffic 12x normal
  • No per-key locking or rate limiting on cache miss

What Went Well

  • Monitoring alerted quickly
  • Redis metrics clearly showed cache miss rate

What Went Poorly

  • Cache TTL not staggered
  • No rate limiting on cache miss
  • DB replica not used for warm-up

Action Items

  • [ ] Implement per-key locking (owner: catalog team, P1)
  • [ ] Stagger cache TTLs (owner: catalog team, P1)
  • [ ] Add rate limiting at ingress (owner: platform team, P1)
  • [ ] Set up read-only replica for cache warm-up (owner: DB team, P2)

Follow-ups

  • P0Implement per-key locking (e.g., Redis SETNX) to prevent stampedecatalog-svc owner
  • P0Stagger cache TTLs with jitter to avoid simultaneous expirycatalog-svc owner
  • P1Add rate limiting at ingress gateway for catalog-svcplatform team
  • P1Provision a read-only DB replica for cache warm-up scriptsDB team
  • P2Review and update incident response runbook for cache stampedeon-call SRE