[Eval][v1][zh] Cache stampede after Redis key expiry on Black Friday morning
service: catalog-svccreated: 6/2/2026, 11:41:16 PM
Raw incident context
Time: 09:00 UTC, Black Friday. catalog-svc latency exploded at exactly 09:00:00 UTC. Symptoms: - catalog-svc p99: 15s (baseline 80ms) - 503 rate: 8% (intermittent during DB overload) - Postgres CPU: 100% sustained, lock waits climbing - Redis CPU: 25% (looks healthy) - Redis cache miss rate for keys matching "catalog:item:*": 95% (baseline 2%) - All product detail page requests are flooding through to DB Background: - We pre-warm the homepage catalog cache every night at 02:00 UTC with TTL=7h - 02:00 UTC + 7h = 09:00 UTC ← all keys expired simultaneously - Black Friday traffic ramp: 12x normal at 09:00 UTC (marketing email blast) - No per-key locking; every cache miss triggers a fresh DB query Logs: - 08:59:58 UTC — last cache hit logged - 09:00:00 UTC — first wave of "cache miss + DB query" log lines - 09:00:02 UTC — DB connection pool hits ceiling, queries start queueing On-call: - 09:02 — pager (multiple alerts: DBHighCPU, CatalogSvcLatencyHigh, CheckoutErrorRate) - 09:03 — diagnosis: cache stampede confirmed by Redis miss-rate metric - 09:05 — debating: extend TTL (won't help in flight), warm cache manually (DB is the bottleneck), or shed load
Summary
09:00 UTC 黑五流量高峰导致 catalog-svc 缓存雪崩,所有缓存键同时过期,DB CPU 100%,p99 延迟 15s,8% 请求返回 503。
Severity reasoning: 用户可见的页面加载失败(503)和极高延迟(15s),影响核心商品详情页,黑五期间直接导致收入损失,属于 SEV1。
deepseek-chat·prompt v1·output: zh·10874ms·1835↑ / 1581↓ tok·$0.00223
Root cause hypotheses
- high缓存键统一在 09:00 UTC 过期,黑五流量 12 倍叠加,导致缓存雪崩
Evidence: Redis cache miss rate 从 2% 飙升至 95%,且所有 miss 键匹配 'catalog:item:*',与预热的 TTL=7h 吻合(02:00+7h=09:00)。
- highDB 连接池耗尽,查询排队导致锁等待和 CPU 100%
Evidence: 日志显示 09:00:02 UTC 连接池达到上限,查询开始排队;Postgres CPU 100% 且 lock waits 上升。
- medium黑五营销邮件导致突发流量超出系统容量
Evidence: 流量为正常 12 倍,且发生在缓存失效的同一时刻,系统未做限流或降级。
Investigation checklist
- 确认缓存雪崩范围:检查 Redis 中 catalog 缓存键的 TTL 分布
redis-cli --scan --pattern 'catalog:item:*' | head -20 | xargs -L1 redis-cli ttlExpected: 大部分键 TTL 为 0 或负数(已过期),少量键可能还有剩余 TTL
- 检查 DB 当前活跃连接数和等待事件
SELECT count(*) FROM pg_stat_activity WHERE state = 'active'; SELECT * FROM pg_stat_activity WHERE wait_event IS NOT NULL LIMIT 10;Expected: 活跃连接数接近池上限,大量查询处于 'wait_event' 状态(如 'LWLock' 或 'ClientRead')
- 检查 DB 中耗时最长的查询
SELECT pid, now() - pg_stat_activity.query_start AS duration, query, state FROM pg_stat_activity WHERE state != 'idle' ORDER BY duration DESC LIMIT 5;Expected: 大量查询为 'SELECT * FROM items WHERE id = ...' 且 duration 超过 10 秒
- 确认 Redis 缓存命中率指标
kubectl exec -n production deploy/redis-metrics -- redis-cli info stats | grep keyspace_missesExpected: keyspace_misses 远高于 keyspace_hits,miss 率约 95%
Mitigation plan
立即在 Redis 中手动预热热门商品缓存,使用批量查询并设置随机 TTL(如 5-8 小时)
Risk: 手动预热会额外增加 DB 负载,可能短暂加剧问题。需先限流或降级部分流量。
Rollback: 停止预热脚本,等待缓存自然过期或回滚代码变更
在 catalog-svc 中启用限流:对 /items/{id} 接口按用户或 IP 限流,拒绝超出部分的请求返回 429
Risk: 部分用户会收到 429,但优于 503。需确保限流不误伤正常用户。
Rollback: 移除限流配置,恢复原始流量处理
临时增加 DB 连接池大小(如从 50 增至 100)并重启 catalog-svc 以应用新配置
Risk: 增加连接数可能加剧 DB CPU 争用,需监控。
Rollback: 恢复原连接池大小并重启服务
Customer impact
黑五期间用户访问商品详情页时,约 8% 请求返回 503 错误,其余请求延迟高达 15 秒,严重影响购物体验和转化率。预计在缓存预热和限流生效后逐步恢复。
Postmortem draft
事后复盘
摘要
09:00 UTC 黑五流量高峰期间,catalog-svc 因缓存雪崩导致 DB 过载,p99 延迟 15s,8% 请求 503。
时间线
- 08:59:58 UTC:最后一次缓存命中
- 09:00:00 UTC:缓存键批量过期,DB 查询激增
- 09:00:02 UTC:DB 连接池耗尽
- 09:02 UTC:告警触发
- 09:03 UTC:诊断确认缓存雪崩
- 09:05 UTC:开始执行缓解措施
影响
- 用户:商品详情页不可用或极慢
- 收入:黑五期间转化率下降
- 时长:约 15 分钟
根因
- 缓存 TTL 统一为 7h,导致同时过期
- 黑五流量 12 倍超出系统容量
- 无缓存雪崩防护(如互斥锁、随机 TTL)
做得好的
- 告警及时
- 快速定位根因
做得不好的
- 缓存策略未考虑流量高峰
- 无限流或降级机制
行动项
- 实施随机 TTL
- 增加限流和降级
- 黑五前进行压测
Follow-ups
- P0为缓存键添加随机 TTL(如基础 TTL 7h ± 1h),避免同时过期— service owner
- P0在 catalog-svc 中实现缓存雪崩防护:使用互斥锁或分布式锁,防止并发 DB 查询— service owner
- P1为关键接口添加限流和降级策略,黑五期间启用— platform team
- P1增加缓存命中率告警,当 miss 率超过阈值时自动触发— on-call SRE
- P2黑五前进行全链路压测,模拟缓存失效场景— service owner
Similar past incidents
lexical match (pg_trgm)
- 70%
[Eval][v1][zh] Cache stampede after Redis key expiry on Black Friday morning
DB CPU 100%, p99 latency 15s, intermittent 503s, Redis CPU normal but cache miss rate at 95%
- 69%
[Eval][v2][zh] Cache stampede after Redis key expiry on Black Friday morning
DB CPU 100%, p99 latency 15s, intermittent 503s, Redis CPU normal but cache miss rate at 95%
- 51%
[Eval][v2][en] Cache stampede after Redis key expiry on Black Friday morning
DB CPU 100%, p99 latency 15s, intermittent 503s, Redis CPU normal but cache miss rate at 95%
- 48%
[Eval][v1][en] Cache stampede after Redis key expiry on Black Friday morning
DB CPU 100%, p99 latency 15s, intermittent 503s, Redis CPU normal but cache miss rate at 95%
- 18%
[Eval][v2][en] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.