← All incidents

[Eval][v3][en] Cache stampede after Redis key expiry on Black Friday morning

service: catalog-svccreated: 6/3/2026, 12:09:32 AM

Raw incident context

Time: 09:00 UTC, Black Friday. catalog-svc latency exploded at exactly 09:00:00 UTC.

Symptoms:
- catalog-svc p99: 15s (baseline 80ms)
- 503 rate: 8% (intermittent during DB overload)
- Postgres CPU: 100% sustained, lock waits climbing
- Redis CPU: 25% (looks healthy)
- Redis cache miss rate for keys matching "catalog:item:*": 95% (baseline 2%)
- All product detail page requests are flooding through to DB

Background:
- We pre-warm the homepage catalog cache every night at 02:00 UTC with TTL=7h
- 02:00 UTC + 7h = 09:00 UTC ← all keys expired simultaneously
- Black Friday traffic ramp: 12x normal at 09:00 UTC (marketing email blast)
- No per-key locking; every cache miss triggers a fresh DB query

Logs:
- 08:59:58 UTC — last cache hit logged
- 09:00:00 UTC — first wave of "cache miss + DB query" log lines
- 09:00:02 UTC — DB connection pool hits ceiling, queries start queueing

On-call:
- 09:02 — pager (multiple alerts: DBHighCPU, CatalogSvcLatencyHigh, CheckoutErrorRate)
- 09:03 — diagnosis: cache stampede confirmed by Redis miss-rate metric
- 09:05 — debating: extend TTL (won't help in flight), warm cache manually (DB is the bottleneck), or shed load

Summary

SEV1

At 09:00 UTC on Black Friday, a cache stampede caused catalog-svc p99 latency to spike to 15s (baseline 80ms) and intermittent 503 errors at 8%. The pre-warmed Redis cache for catalog items expired simultaneously (TTL=7h from 02:00 UTC) under 12x normal traffic from a marketing email blast, flooding Postgres with queries and driving CPU to 100% with lock waits. Mitigation is underway to shed load and manually warm the cache.

Severity reasoning: User-facing outage: p99 latency 15s (baseline 80ms) and 503 error rate 8% > 1% for >5 minutes, affecting all product detail page requests. Revenue path is broken as catalog-svc is critical for Black Friday shopping. This meets SEV1 criteria.

deepseek-chat·prompt v3·output: en·15943ms·2056↑ / 2154↓ tok·$0.00292

Root cause hypotheses

  • highCache stampede due to simultaneous key expiration at 09:00 UTC under high traffic

    Evidence: Redis cache miss rate for 'catalog:item:*' jumped to 95% at 09:00 UTC (baseline 2%). Last cache hit at 08:59:58 UTC. TTL=7h from 02:00 UTC pre-warm aligns with expiration at 09:00 UTC.

  • highPostgres connection pool exhaustion from concurrent cache miss queries

    Evidence: Logs at 09:00:02 UTC show 'DB connection pool hits ceiling, queries start queueing'. Postgres CPU at 100% sustained with lock waits climbing.

  • lowDatabase query performance degradation due to missing index or slow query

    Evidence: No direct evidence of missing indexes; baseline p99 was 80ms. The sudden spike at 09:00 UTC correlates with cache expiration, not a schema change. Unlikely.

  • lowNetwork latency or resource contention on database host

    Evidence: No network errors reported. CPU is 100% due to query load, not I/O wait. Likely not a network issue.

  • lowRedis instance failure or misconfiguration causing cache misses

    Evidence: Redis CPU is 25% (healthy), and cache miss rate is specific to catalog keys. Redis itself is operational. Unlikely.

Investigation checklist

  1. Check current Redis cache miss rate for catalog keys
    redis-cli -h redis-catalog -p 6379 INFO stats | grep -E 'keyspace_misses|keyspace_hits'

    Expected: keyspace_misses should be high (e.g., >90%) if stampede ongoing; if low, stampede may have subsided.

  2. Check Postgres active queries and lock waits
    psql -h pg-catalog -U app -c "SELECT pid, query_start, state, wait_event_type, query FROM pg_stat_activity WHERE state != 'idle' ORDER BY query_start LIMIT 20;"

    Expected: Many queries with state 'active' and wait_event_type 'LWLock' or 'Lock' indicating contention.

  3. Check Postgres connection pool usage
    psql -h pg-catalog -U app -c "SELECT count(*) FROM pg_stat_activity WHERE state != 'idle';"

    Expected: Count near max_connections (e.g., 200) indicating pool exhaustion.

  4. Check catalog-svc error logs for 503s
    kubectl logs -n prod -l app=catalog-svc --since=15m | grep -E 'HTTP 503|timeout|connection refused' | tail -50

    Expected: Multiple 503 lines with timestamps after 09:00 UTC.

  5. Verify cache TTL configuration
    kubectl exec -n prod deploy/catalog-svc -- cat /app/config.yaml | grep -A2 'cache.*ttl'

    Expected: TTL set to 7h (25200s) for catalog items.

  6. Check if any recent deployment changed cache behavior
    kubectl rollout history -n prod deploy/catalog-svc

    Expected: No recent changes; last rollout >24h ago.

Mitigation plan

  • Shed load by rate-limiting catalog-svc at ingress (e.g., nginx) to reduce DB pressure

    Risk: Some users will see 429 errors, but prevents complete outage. Blast radius: all catalog requests.

    Rollback: Remove rate limit by reverting ingress config change.

  • Manually warm Redis cache by running a script to re-populate popular catalog items with a longer TTL (e.g., 12h)

    Risk: Script may increase DB load further if not throttled. Blast radius: DB CPU may spike temporarily.

    Rollback: Stop the script; cache will expire naturally.

  • Increase Postgres connection pool size temporarily (e.g., from 100 to 200) to handle queued queries

    Risk: May increase CPU/memory pressure on DB. Blast radius: DB host resources.

    Rollback: Revert connection pool size to original value.

  • Extend cache TTL for catalog items to 12h to prevent near-future expiration during peak

    Risk: Stale data may be served for up to 12h if items change. Blast radius: catalog freshness.

    Rollback: Revert TTL to 7h.

Customer impact

Users are experiencing slow loading (up to 15 seconds) and intermittent errors when viewing product detail pages. Approximately 12x normal traffic from a Black Friday marketing campaign is affected. Estimated 100% of users hitting catalog pages are impacted. No ETA yet.

Postmortem draft

Summary

On Black Friday at 09:00 UTC, a cache stampede caused catalog-svc p99 latency to spike to 15s and 8% 503 errors. The pre-warmed Redis cache expired simultaneously under 12x traffic, flooding Postgres.

Timeline (UTC)

  • 02:00: Pre-warm cache with TTL=7h
  • 08:59:58: Last cache hit
  • 09:00:00: Cache miss wave begins
  • 09:00:02: DB connection pool exhausted
  • 09:02: Pager triggered
  • 09:03: Diagnosis: cache stampede
  • [FILL IN] Mitigation actions taken
  • [FILL IN] Incident resolved

Impact

  • p99 latency: 15s (baseline 80ms)
  • 503 error rate: 8%
  • Affected users: all catalog page requests
  • Revenue impact: [FILL IN]

Root Cause

Cache stampede due to simultaneous expiration of pre-warmed keys at 09:00 UTC, coinciding with 12x traffic from Black Friday marketing blast. No per-key locking or staggered expiration.

Detection

Alerts for DBHighCPU, CatalogSvcLatencyHigh, CheckoutErrorRate at 09:02. Redis miss-rate metric confirmed stampede.

Response

  • [FILL IN] Load shedding, manual cache warm, connection pool increase
  • [FILL IN] Communication

What Went Well

  • Alerting caught the issue quickly
  • Redis miss-rate metric was key to diagnosis

What Went Poorly

  • Cache TTL not staggered for high-traffic events
  • No per-key locking or circuit breaker for cache misses
  • Pre-warm timing not aligned with traffic patterns

Action Items

  • [ ] Implement staggered cache TTL with jitter
  • [ ] Add per-key locking (e.g., SETNX) for cache misses
  • [ ] Set up circuit breaker for DB queries under load
  • [ ] Review pre-warm schedule for Black Friday
  • [ ] Add load shedding automation

Follow-ups

  • P0Implement staggered cache TTL with random jitter to prevent simultaneous expirationcatalog-svc team
  • P0Add per-key locking (e.g., Redis SETNX) for cache misses to reduce DB loadcatalog-svc team
  • P1Set up circuit breaker for catalog-svc DB queries to shed load automaticallyplatform team
  • P1Review pre-warm schedule and TTL for Black Friday and other peak eventscatalog-svc team
  • P1Add alert on cache miss rate spike to detect stampedes earlieron-call SRE
  • P2Document incident response steps for cache stampede in runbookon-call SRE