← All incidents

[Eval][v1][en] Cache stampede after Redis key expiry on Black Friday morning

service: catalog-svccreated: 6/3/2026, 12:06:05 AM

Raw incident context

Time: 09:00 UTC, Black Friday. catalog-svc latency exploded at exactly 09:00:00 UTC.

Symptoms:
- catalog-svc p99: 15s (baseline 80ms)
- 503 rate: 8% (intermittent during DB overload)
- Postgres CPU: 100% sustained, lock waits climbing
- Redis CPU: 25% (looks healthy)
- Redis cache miss rate for keys matching "catalog:item:*": 95% (baseline 2%)
- All product detail page requests are flooding through to DB

Background:
- We pre-warm the homepage catalog cache every night at 02:00 UTC with TTL=7h
- 02:00 UTC + 7h = 09:00 UTC ← all keys expired simultaneously
- Black Friday traffic ramp: 12x normal at 09:00 UTC (marketing email blast)
- No per-key locking; every cache miss triggers a fresh DB query

Logs:
- 08:59:58 UTC — last cache hit logged
- 09:00:00 UTC — first wave of "cache miss + DB query" log lines
- 09:00:02 UTC — DB connection pool hits ceiling, queries start queueing

On-call:
- 09:02 — pager (multiple alerts: DBHighCPU, CatalogSvcLatencyHigh, CheckoutErrorRate)
- 09:03 — diagnosis: cache stampede confirmed by Redis miss-rate metric
- 09:05 — debating: extend TTL (won't help in flight), warm cache manually (DB is the bottleneck), or shed load

Summary

SEV1

At 09:00 UTC on Black Friday, a cache stampede in catalog-svc caused all product detail page requests to hit Postgres simultaneously after a pre-warmed cache (TTL=7h) expired. DB CPU hit 100%, p99 latency rose to 15s, and 8% of requests returned 503. Redis is healthy but cache miss rate is 95%.

Severity reasoning: User-facing outage with 8% error rate and 15s latency, revenue impact on Black Friday, and DB overload risking data integrity. Full outage of a critical system (catalog) for a subset of users.

deepseek-chat·prompt v1·output: en·13353ms·1424↑ / 1736↓ tok·$0.00229

Root cause hypotheses

  • highCache stampede from simultaneous TTL expiry of pre-warmed catalog cache at 09:00 UTC combined with 12x Black Friday traffic.

    Evidence: Redis cache miss rate for 'catalog:item:*' jumped from 2% to 95% at 09:00 UTC; last cache hit at 08:59:58 UTC; pre-warm TTL=7h from 02:00 UTC expires exactly at 09:00 UTC.

  • highDB connection pool exhaustion due to concurrent cache miss queries overwhelming Postgres.

    Evidence: DB CPU 100% sustained, lock waits climbing, connection pool hit ceiling at 09:00:02 UTC, queries started queueing.

  • mediumMissing per-key locking or request coalescing in cache miss handling.

    Evidence: No per-key locking; every cache miss triggers a fresh DB query, leading to duplicate queries for the same key.

Investigation checklist

  1. Check current DB connection pool usage and query queue depth.
    kubectl exec -n prod catalog-svc-0 -- pg_isready -h catalog-db -U app && kubectl exec -n prod catalog-svc-0 -- psql -h catalog-db -U app -c "SELECT count(*) FROM pg_stat_activity WHERE state = 'active';"

    Expected: Active connections near max (e.g., 100/100) and high wait count.

  2. Identify top queries consuming DB CPU.
    kubectl exec -n prod catalog-svc-0 -- psql -h catalog-db -U app -c "SELECT query, calls, total_time, mean_time FROM pg_stat_statements ORDER BY total_time DESC LIMIT 10;"

    Expected: Multiple identical SELECT queries for catalog items (e.g., 'SELECT * FROM items WHERE id = $1').

  3. Confirm Redis cache miss rate and key pattern.
    kubectl exec -n prod catalog-svc-0 -- redis-cli -h catalog-redis INFO stats | grep -E 'keyspace_misses|keyspace_hits'

    Expected: keyspace_misses >> keyspace_hits, miss rate ~95%.

  4. Check if any cache keys are still valid (e.g., recently accessed items).
    kubectl exec -n prod catalog-svc-0 -- redis-cli -h catalog-redis TTL catalog:item:popular123

    Expected: TTL = -2 (expired) or very low.

  5. Verify traffic spike from marketing email blast.
    kubectl exec -n prod catalog-svc-0 -- curl -s http://catalog-svc:8080/metrics | grep 'http_requests_total' | tail -5

    Expected: Request rate 12x normal (e.g., 12000 req/s vs 1000 req/s).

Mitigation plan

  • Enable request coalescing for cache misses (e.g., use Redis SETNX with short TTL to deduplicate concurrent misses for same key).

    Risk: May add latency for the first request per key; no destructive operations.

    Rollback: Disable coalescing feature flag.

  • Manually warm the cache by running a script that re-queries DB for popular items and writes to Redis with a staggered TTL (e.g., 5-10 min jitter).

    Risk: Additional DB load during warm-up; could worsen DB CPU if not throttled.

    Rollback: Stop the warm-up script immediately; DB load will drop as cache fills.

  • Shed load by temporarily blocking non-critical traffic (e.g., disable product recommendations API) via feature flag or rate limiter.

    Risk: Degraded user experience for recommendations; no data loss.

    Rollback: Re-enable recommendations API.

Customer impact

Customers on Black Friday are experiencing 15-second load times and intermittent 503 errors when viewing product detail pages. Approximately 8% of requests fail. No data loss. ETA for full recovery: 30 minutes after mitigation actions.

Postmortem draft

Postmortem: Catalog Cache Stampede on Black Friday

Summary: At 09:00 UTC, a pre-warmed cache with TTL=7h expired simultaneously, causing a stampede to Postgres under 12x traffic. DB CPU 100%, p99 latency 15s, 8% 503s.

Timeline:

  • 02:00 UTC: Pre-warm cache completed (TTL=7h)
  • 08:59:58: Last cache hit
  • 09:00:00: Cache miss wave begins
  • 09:00:02: DB connection pool exhausted
  • 09:02: Pager received
  • 09:03: Diagnosis: cache stampede
  • 09:05: Mitigation actions started

Impact: 8% error rate, 15s latency for catalog-svc, affecting all product detail pages.

Root Cause: Cache stampede due to simultaneous TTL expiry without per-key locking or request coalescing.

What Went Well:

  • Monitoring detected the issue within 2 minutes.
  • Redis remained healthy.

What Went Poorly:

  • No request coalescing for cache misses.
  • Pre-warm TTL not staggered.
  • No circuit breaker for DB overload.

Action Items:

  • [P0] Implement request coalescing for cache misses.
  • [P1] Add jitter to cache TTLs.
  • [P1] Add circuit breaker for DB queries.
  • [P2] Review pre-warm strategy for high-traffic events.

Follow-ups

  • P0Implement request coalescing for cache misses (e.g., Redis SETNX with short TTL).service owner
  • P1Add jitter to cache TTLs to prevent simultaneous expiry.service owner
  • P1Add circuit breaker for DB queries to shed load under high latency.platform team
  • P2Review pre-warm strategy for Black Friday and other high-traffic events.on-call SRE
  • P2Add alert for cache miss rate > 50%.platform team