[Eval][v1][zh] Cache stampede after Redis key expiry on Black Friday morning
service: catalog-svccreated: 6/3/2026, 12:06:38 AM
Raw incident context
Time: 09:00 UTC, Black Friday. catalog-svc latency exploded at exactly 09:00:00 UTC. Symptoms: - catalog-svc p99: 15s (baseline 80ms) - 503 rate: 8% (intermittent during DB overload) - Postgres CPU: 100% sustained, lock waits climbing - Redis CPU: 25% (looks healthy) - Redis cache miss rate for keys matching "catalog:item:*": 95% (baseline 2%) - All product detail page requests are flooding through to DB Background: - We pre-warm the homepage catalog cache every night at 02:00 UTC with TTL=7h - 02:00 UTC + 7h = 09:00 UTC ← all keys expired simultaneously - Black Friday traffic ramp: 12x normal at 09:00 UTC (marketing email blast) - No per-key locking; every cache miss triggers a fresh DB query Logs: - 08:59:58 UTC — last cache hit logged - 09:00:00 UTC — first wave of "cache miss + DB query" log lines - 09:00:02 UTC — DB connection pool hits ceiling, queries start queueing On-call: - 09:02 — pager (multiple alerts: DBHighCPU, CatalogSvcLatencyHigh, CheckoutErrorRate) - 09:03 — diagnosis: cache stampede confirmed by Redis miss-rate metric - 09:05 — debating: extend TTL (won't help in flight), warm cache manually (DB is the bottleneck), or shed load
Summary
Black Friday流量高峰导致缓存雪崩,所有商品详情页请求穿透至Postgres,造成CPU 100%、p99延迟15秒、间歇性503错误。
Severity reasoning: 用户可见的页面加载失败(503率8%),p99延迟从80ms飙升至15s,影响核心商品浏览功能,Black Friday期间直接导致收入损失。
deepseek-chat·prompt v1·output: zh·11840ms·1835↑ / 1389↓ tok·$0.00202
Root cause hypotheses
- high缓存键统一过期导致雪崩:预热的catalog:item:*缓存TTL=7h,02:00写入,09:00全部过期,同时12倍流量涌入。
Evidence: Redis缓存命中率从98%骤降至5%,日志显示09:00:00起大量'cache miss + DB query',DB CPU瞬间100%。
- highDB连接池耗尽:缓存未命中导致所有请求同时发起DB查询,连接池满后查询排队,锁等待增加。
Evidence: DB连接池达到上限,锁等待指标上升,日志显示09:00:02起查询开始排队。
- medium慢查询放大:商品详情页查询未优化,在高并发下成为瓶颈。
Evidence: DB CPU 100%但未提及具体慢查询日志,需进一步确认。
Investigation checklist
- 确认Redis缓存键过期时间分布
redis-cli -h <redis-host> -p 6379 TTL catalog:item:12345 # 检查典型键的TTLExpected: 返回-2(已过期)或接近0的值,确认批量过期
- 查看DB当前活跃查询和锁等待
SELECT pid, state, wait_event_type, query FROM pg_stat_activity WHERE state = 'active' ORDER BY query_start;Expected: 大量SELECT查询,wait_event_type为'Lock'或'IO'
- 检查DB连接池使用率
SELECT count(*) FROM pg_stat_activity; # 对比max_connectionsExpected: 接近max_connections(如500),大量idle in transaction或active连接
- 确认Redis缓存未命中率趋势
redis-cli INFO stats | grep keyspace_misses # 计算命中率Expected: keyspace_misses/hits比例远高于正常值
Mitigation plan
立即启用限流:在catalog-svc入口对商品详情页请求进行限流,保护DB不被压垮。
Risk: 部分用户请求被拒绝,但可避免完全宕机。
Rollback: 移除限流规则或调整阈值至正常水平。
手动预热缓存:通过脚本批量查询热门商品,重新填充Redis缓存,设置随机TTL(如6-8小时)。
Risk: 预热查询本身会加重DB负载,需在限流生效后执行。
Rollback: 停止预热脚本,等待缓存自然过期。
扩展DB读副本:临时增加Postgres只读副本,将查询流量分流。
Risk: 副本同步延迟可能导致数据不一致,需确保应用可接受。
Rollback: 移除新增副本,恢复单主库配置。
Customer impact
Black Friday期间,用户访问商品详情页时出现高延迟(15秒)或503错误,约8%的请求失败,影响购物体验和转化率。
Postmortem draft
Postmortem: catalog-svc 缓存雪崩
摘要
Black Friday 09:00 UTC,缓存统一过期导致DB过载,p99延迟15s,503率8%。
时间线
- 02:00 UTC: 预热缓存,TTL=7h
- 09:00 UTC: 缓存过期,流量12倍
- 09:02 UTC: 告警触发
- 09:05 UTC: 诊断确认缓存雪崩
- 09:10 UTC: 实施限流
- 09:15 UTC: 手动预热缓存
- 09:30 UTC: 服务恢复
影响
- 用户:页面加载失败或超时
- 收入:估计损失X美元
根因
缓存键统一过期,无防雪崩机制(如互斥锁、随机TTL)。
做得好的
- 监控告警及时
- 团队快速诊断
做得不好的
- 缓存设计未考虑过期风暴
- 无自动限流机制
行动项
- [ ] 缓存TTL增加随机偏移
- [ ] 实现缓存重建互斥锁
- [ ] 添加自动限流和降级
- [ ] 压力测试缓存场景
Follow-ups
- P0为所有缓存键添加随机TTL偏移(±30%),避免统一过期— service owner
- P0实现缓存重建互斥锁(如SETNX),防止并发DB查询— service owner
- P1添加自动限流和降级策略,当缓存命中率低于阈值时触发— platform team
- P1对商品详情页查询进行性能优化,添加索引或汇总表— service owner
- P2进行缓存雪崩场景的混沌工程演练— SRE team
Similar past incidents
lexical match (pg_trgm)
- 65%
[Eval][v2][zh] Cache stampede after Redis key expiry on Black Friday morning
DB CPU 100%, p99 latency 15s, intermittent 503s, Redis CPU normal but cache miss rate at 95%
- 60%
[Eval][v1][zh] Cache stampede after Redis key expiry on Black Friday morning
DB CPU 100%, p99 latency 15s, intermittent 503s, Redis CPU normal but cache miss rate at 95%
- 50%
[Eval][v2][en] Cache stampede after Redis key expiry on Black Friday morning
DB CPU 100%, p99 latency 15s, intermittent 503s, Redis CPU normal but cache miss rate at 95%
- 45%
[Eval][v1][en] Cache stampede after Redis key expiry on Black Friday morning
DB CPU 100%, p99 latency 15s, intermittent 503s, Redis CPU normal but cache miss rate at 95%
- 18%
[Eval][v2][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts