[Eval][v1][zh] Cache stampede after Redis key expiry on Black Friday morning
service: catalog-svccreated: 6/3/2026, 12:07:16 AM
Raw incident context
Time: 09:00 UTC, Black Friday. catalog-svc latency exploded at exactly 09:00:00 UTC. Symptoms: - catalog-svc p99: 15s (baseline 80ms) - 503 rate: 8% (intermittent during DB overload) - Postgres CPU: 100% sustained, lock waits climbing - Redis CPU: 25% (looks healthy) - Redis cache miss rate for keys matching "catalog:item:*": 95% (baseline 2%) - All product detail page requests are flooding through to DB Background: - We pre-warm the homepage catalog cache every night at 02:00 UTC with TTL=7h - 02:00 UTC + 7h = 09:00 UTC ← all keys expired simultaneously - Black Friday traffic ramp: 12x normal at 09:00 UTC (marketing email blast) - No per-key locking; every cache miss triggers a fresh DB query Logs: - 08:59:58 UTC — last cache hit logged - 09:00:00 UTC — first wave of "cache miss + DB query" log lines - 09:00:02 UTC — DB connection pool hits ceiling, queries start queueing On-call: - 09:02 — pager (multiple alerts: DBHighCPU, CatalogSvcLatencyHigh, CheckoutErrorRate) - 09:03 — diagnosis: cache stampede confirmed by Redis miss-rate metric - 09:05 — debating: extend TTL (won't help in flight), warm cache manually (DB is the bottleneck), or shed load
Summary
09:00 UTC 黑五流量高峰期间,catalog-svc 因缓存雪崩导致数据库 CPU 100%、p99 延迟 15s、间歇性 503。预热的缓存键在 09:00 同时过期,所有请求穿透到数据库,连接池耗尽。
Severity reasoning: 用户可见的页面加载失败(503 错误率 8%),核心商品详情接口不可用,影响黑五促销收入,属于 SEV1。
deepseek-chat·prompt v1·output: zh·12737ms·1835↑ / 1622↓ tok·$0.00228
Root cause hypotheses
- high缓存键统一过期导致雪崩:所有 catalog:item:* 键 TTL=7h,02:00 预热,09:00 同时过期,大量请求直接打到数据库。
Evidence: Redis 缓存命中率从 98% 骤降至 5%,日志显示 09:00:00 开始大量 'cache miss + DB query'。
- high数据库连接池耗尽:大量并发查询导致 Postgres 连接数达到上限,新查询排队等待。
Evidence: Postgres CPU 100%,lock waits 持续上升,catalog-svc 日志显示连接池满。
- medium慢查询放大效应:某些商品详情查询因缺少索引或数据倾斜变慢,加剧数据库负载。
Evidence: 未直接观察到慢查询日志,但 p99 15s 远高于基线 80ms,可能存在慢查询。
Investigation checklist
- 确认 Redis 缓存键过期时间分布
redis-cli -h <redis-host> -p 6379 --scan --pattern 'catalog:item:*' | head -20 | xargs -I {} redis-cli -h <redis-host> -p 6379 TTL {}Expected: 大部分键 TTL 为 0 或接近 0,确认同时过期
- 查看数据库当前活跃连接数和等待事件
SELECT count(*) FROM pg_stat_activity WHERE state = 'active'; SELECT * FROM pg_stat_activity WHERE wait_event IS NOT NULL;Expected: 活跃连接数接近 max_connections,大量连接处于 'wait_event' 状态
- 检查数据库慢查询日志
grep 'duration' /var/log/postgresql/postgresql-*.log | tail -20Expected: 出现大量耗时超过 1s 的查询,多为 SELECT * FROM catalog_items WHERE id = ?
- 确认缓存预热任务是否正常执行
kubectl logs -n production -l app=catalog-cache-warmer --tail=50 | grep '02:00'Expected: 预热任务在 02:00 成功完成,写入大量键
- 检查 Redis 内存使用和逐出策略
redis-cli -h <redis-host> -p 6379 INFO memory | grep -E 'used_memory|maxmemory|evicted_keys'Expected: 内存未满,无大量逐出,排除 Redis 内存不足导致缓存失效
Mitigation plan
立即在 Redis 中手动预热热门商品缓存,使用批量写入并设置随机 TTL(如 6-8 小时)
Risk: 手动写入可能增加数据库瞬时负载,但预热查询量可控;随机 TTL 可避免再次同时过期
Rollback: 若导致数据库负载更高,停止预热脚本,恢复原有缓存策略
在 catalog-svc 入口开启限流,拒绝超过数据库处理能力的请求,返回 429 或降级页面
Risk: 部分用户看到限流错误,但可保护数据库不被完全打垮
Rollback: 移除限流规则,恢复全量请求
临时增加数据库连接池大小(如从 50 增至 100),并重启 catalog-svc 以应用新配置
Risk: 过多连接可能耗尽数据库内存,需监控 CPU 和内存
Rollback: 恢复原连接池大小并重启服务
Customer impact
黑五期间,用户访问商品详情页时出现高延迟(15s)或 503 错误,约 8% 的请求失败,影响购物体验和订单转化。
Postmortem draft
Postmortem: catalog-svc 缓存雪崩导致服务降级
摘要
09:00 UTC 黑五流量高峰,catalog-svc 因缓存键同时过期引发雪崩,数据库过载,p99 延迟 15s,503 错误率 8%。
时间线
- 02:00 UTC: 缓存预热完成,TTL=7h
- 08:59:58: 最后缓存命中
- 09:00:00: 缓存键批量过期,请求穿透到 DB
- 09:00:02: DB 连接池耗尽,查询排队
- 09:02: 告警触发
- 09:03: 诊断确认缓存雪崩
- 09:05: 开始手动预热和限流
影响
- 用户:商品详情页高延迟/503
- 业务:黑五收入受损
- 时长:约 15 分钟(09:00-09:15)
根因
缓存键统一 TTL 导致同时过期,缺乏防雪崩机制(如互斥锁、随机 TTL)。
做得好的
- 监控告警及时
- 快速定位根因
做得不好的
- 缓存策略未考虑黑五流量
- 无缓存雪崩防护
行动项
- 缓存键添加随机 TTL
- 实现缓存重建互斥锁
- 增加黑五容量评估
- 添加缓存命中率告警
Follow-ups
- P0为所有缓存键添加随机 TTL(基础 TTL ± 20%)— service owner
- P0实现缓存重建互斥锁(如 SETNX)防止雪崩— service owner
- P1添加缓存命中率告警(低于 90% 触发)— on-call SRE
- P1黑五前进行容量测试和预热演练— platform team
- P2评估数据库连接池大小和限流策略— service owner
Similar past incidents
lexical match (pg_trgm)
- 79%
[Eval][v1][zh] Cache stampede after Redis key expiry on Black Friday morning
DB CPU 100%, p99 latency 15s, intermittent 503s, Redis CPU normal but cache miss rate at 95%
- 72%
[Eval][v2][zh] Cache stampede after Redis key expiry on Black Friday morning
DB CPU 100%, p99 latency 15s, intermittent 503s, Redis CPU normal but cache miss rate at 95%
- 48%
[Eval][v2][en] Cache stampede after Redis key expiry on Black Friday morning
DB CPU 100%, p99 latency 15s, intermittent 503s, Redis CPU normal but cache miss rate at 95%
- 44%
[Eval][v1][en] Cache stampede after Redis key expiry on Black Friday morning
DB CPU 100%, p99 latency 15s, intermittent 503s, Redis CPU normal but cache miss rate at 95%
- 18%
[Eval][v1][zh] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted