[Eval][v3][en] Cache stampede after Redis key expiry on Black Friday morning

service: catalog-svccreated: 6/3/2026, 12:09:32 AM

Raw incident context

Time: 09:00 UTC, Black Friday. catalog-svc latency exploded at exactly 09:00:00 UTC.

Symptoms:
- catalog-svc p99: 15s (baseline 80ms)
- 503 rate: 8% (intermittent during DB overload)
- Postgres CPU: 100% sustained, lock waits climbing
- Redis CPU: 25% (looks healthy)
- Redis cache miss rate for keys matching "catalog:item:*": 95% (baseline 2%)
- All product detail page requests are flooding through to DB

Background:
- We pre-warm the homepage catalog cache every night at 02:00 UTC with TTL=7h
- 02:00 UTC + 7h = 09:00 UTC ← all keys expired simultaneously
- Black Friday traffic ramp: 12x normal at 09:00 UTC (marketing email blast)
- No per-key locking; every cache miss triggers a fresh DB query

Logs:
- 08:59:58 UTC — last cache hit logged
- 09:00:00 UTC — first wave of "cache miss + DB query" log lines
- 09:00:02 UTC — DB connection pool hits ceiling, queries start queueing

On-call:
- 09:02 — pager (multiple alerts: DBHighCPU, CatalogSvcLatencyHigh, CheckoutErrorRate)
- 09:03 — diagnosis: cache stampede confirmed by Redis miss-rate metric
- 09:05 — debating: extend TTL (won't help in flight), warm cache manually (DB is the bottleneck), or shed load

Summary

SEV1

At 09:00 UTC on Black Friday, a cache stampede caused catalog-svc p99 latency to spike to 15s (baseline 80ms) and intermittent 503 errors at 8%. The pre-warmed Redis cache for catalog items expired simultaneously (TTL=7h from 02:00 UTC) under 12x normal traffic from a marketing email blast, flooding Postgres with queries and driving CPU to 100% with lock waits. Mitigation is underway to shed load and manually warm the cache.

Severity reasoning: User-facing outage: p99 latency 15s (baseline 80ms) and 503 error rate 8% > 1% for >5 minutes, affecting all product detail page requests. Revenue path is broken as catalog-svc is critical for Black Friday shopping. This meets SEV1 criteria.

deepseek-chat·prompt v3·output: en·15943ms·2056↑ / 2154↓ tok·$0.00292

Root cause hypotheses

highCache stampede due to simultaneous key expiration at 09:00 UTC under high traffic
Evidence: Redis cache miss rate for 'catalog:item:*' jumped to 95% at 09:00 UTC (baseline 2%). Last cache hit at 08:59:58 UTC. TTL=7h from 02:00 UTC pre-warm aligns with expiration at 09:00 UTC.
highPostgres connection pool exhaustion from concurrent cache miss queries
Evidence: Logs at 09:00:02 UTC show 'DB connection pool hits ceiling, queries start queueing'. Postgres CPU at 100% sustained with lock waits climbing.
lowDatabase query performance degradation due to missing index or slow query
Evidence: No direct evidence of missing indexes; baseline p99 was 80ms. The sudden spike at 09:00 UTC correlates with cache expiration, not a schema change. Unlikely.
lowNetwork latency or resource contention on database host
Evidence: No network errors reported. CPU is 100% due to query load, not I/O wait. Likely not a network issue.
lowRedis instance failure or misconfiguration causing cache misses
Evidence: Redis CPU is 25% (healthy), and cache miss rate is specific to catalog keys. Redis itself is operational. Unlikely.

Investigation checklist

Check current Redis cache miss rate for catalog keys
```
redis-cli -h redis-catalog -p 6379 INFO stats | grep -E 'keyspace_misses|keyspace_hits'
```
Expected: keyspace_misses should be high (e.g., >90%) if stampede ongoing; if low, stampede may have subsided.

Check Postgres active queries and lock waits

psql -h pg-catalog -U app -c "SELECT pid, query_start, state, wait_event_type, query FROM pg_stat_activity WHERE state != 'idle' ORDER BY query_start LIMIT 20;"

Expected: Many queries with state 'active' and wait_event_type 'LWLock' or 'Lock' indicating contention.

Check Postgres connection pool usage
```
psql -h pg-catalog -U app -c "SELECT count(*) FROM pg_stat_activity WHERE state != 'idle';"
```
Expected: Count near max_connections (e.g., 200) indicating pool exhaustion.

Check catalog-svc error logs for 503s

kubectl logs -n prod -l app=catalog-svc --since=15m | grep -E 'HTTP 503|timeout|connection refused' | tail -50

Expected: Multiple 503 lines with timestamps after 09:00 UTC.

Verify cache TTL configuration

kubectl exec -n prod deploy/catalog-svc -- cat /app/config.yaml | grep -A2 'cache.*ttl'

Expected: TTL set to 7h (25200s) for catalog items.

Check if any recent deployment changed cache behavior
```
kubectl rollout history -n prod deploy/catalog-svc
```
Expected: No recent changes; last rollout >24h ago.

Mitigation plan

Shed load by rate-limiting catalog-svc at ingress (e.g., nginx) to reduce DB pressure
Risk: Some users will see 429 errors, but prevents complete outage. Blast radius: all catalog requests.
Rollback: Remove rate limit by reverting ingress config change.
Manually warm Redis cache by running a script to re-populate popular catalog items with a longer TTL (e.g., 12h)
Risk: Script may increase DB load further if not throttled. Blast radius: DB CPU may spike temporarily.
Rollback: Stop the script; cache will expire naturally.
Increase Postgres connection pool size temporarily (e.g., from 100 to 200) to handle queued queries
Risk: May increase CPU/memory pressure on DB. Blast radius: DB host resources.
Rollback: Revert connection pool size to original value.
Extend cache TTL for catalog items to 12h to prevent near-future expiration during peak
Risk: Stale data may be served for up to 12h if items change. Blast radius: catalog freshness.
Rollback: Revert TTL to 7h.

Customer impact

Users are experiencing slow loading (up to 15 seconds) and intermittent errors when viewing product detail pages. Approximately 12x normal traffic from a Black Friday marketing campaign is affected. Estimated 100% of users hitting catalog pages are impacted. No ETA yet.

Postmortem draft

Summary

On Black Friday at 09:00 UTC, a cache stampede caused catalog-svc p99 latency to spike to 15s and 8% 503 errors. The pre-warmed Redis cache expired simultaneously under 12x traffic, flooding Postgres.

Timeline (UTC)

02:00: Pre-warm cache with TTL=7h
08:59:58: Last cache hit
09:00:00: Cache miss wave begins
09:00:02: DB connection pool exhausted
09:02: Pager triggered
09:03: Diagnosis: cache stampede
[FILL IN] Mitigation actions taken
[FILL IN] Incident resolved

Impact

p99 latency: 15s (baseline 80ms)
503 error rate: 8%
Affected users: all catalog page requests
Revenue impact: [FILL IN]

Root Cause

Cache stampede due to simultaneous expiration of pre-warmed keys at 09:00 UTC, coinciding with 12x traffic from Black Friday marketing blast. No per-key locking or staggered expiration.

Detection

Alerts for DBHighCPU, CatalogSvcLatencyHigh, CheckoutErrorRate at 09:02. Redis miss-rate metric confirmed stampede.

Response

[FILL IN] Load shedding, manual cache warm, connection pool increase
[FILL IN] Communication

What Went Well

Alerting caught the issue quickly
Redis miss-rate metric was key to diagnosis

What Went Poorly

Cache TTL not staggered for high-traffic events
No per-key locking or circuit breaker for cache misses
Pre-warm timing not aligned with traffic patterns

Action Items

[ ] Implement staggered cache TTL with jitter
[ ] Add per-key locking (e.g., SETNX) for cache misses
[ ] Set up circuit breaker for DB queries under load
[ ] Review pre-warm schedule for Black Friday
[ ] Add load shedding automation

Follow-ups

P0Implement staggered cache TTL with random jitter to prevent simultaneous expiration— catalog-svc team
P0Add per-key locking (e.g., Redis SETNX) for cache misses to reduce DB load— catalog-svc team
P1Set up circuit breaker for catalog-svc DB queries to shed load automatically— platform team
P1Review pre-warm schedule and TTL for Black Friday and other peak events— catalog-svc team
P1Add alert on cache miss rate spike to detect stampedes earlier— on-call SRE
P2Document incident response steps for cache stampede in runbook— on-call SRE

Similar past incidents

lexical match (pg_trgm)