[Eval][v1][zh] Cache stampede after Redis key expiry on Black Friday morning
service: catalog-svccreated: 6/3/2026, 12:06:58 AM
Raw incident context
Time: 09:00 UTC, Black Friday. catalog-svc latency exploded at exactly 09:00:00 UTC. Symptoms: - catalog-svc p99: 15s (baseline 80ms) - 503 rate: 8% (intermittent during DB overload) - Postgres CPU: 100% sustained, lock waits climbing - Redis CPU: 25% (looks healthy) - Redis cache miss rate for keys matching "catalog:item:*": 95% (baseline 2%) - All product detail page requests are flooding through to DB Background: - We pre-warm the homepage catalog cache every night at 02:00 UTC with TTL=7h - 02:00 UTC + 7h = 09:00 UTC ← all keys expired simultaneously - Black Friday traffic ramp: 12x normal at 09:00 UTC (marketing email blast) - No per-key locking; every cache miss triggers a fresh DB query Logs: - 08:59:58 UTC — last cache hit logged - 09:00:00 UTC — first wave of "cache miss + DB query" log lines - 09:00:02 UTC — DB connection pool hits ceiling, queries start queueing On-call: - 09:02 — pager (multiple alerts: DBHighCPU, CatalogSvcLatencyHigh, CheckoutErrorRate) - 09:03 — diagnosis: cache stampede confirmed by Redis miss-rate metric - 09:05 — debating: extend TTL (won't help in flight), warm cache manually (DB is the bottleneck), or shed load
Summary
09:00 UTC黑五流量高峰期间,catalog-svc因缓存雪崩导致数据库CPU 100%,p99延迟15秒,8%请求返回503。所有商品详情页请求穿透至PostgreSQL,连接池耗尽,锁等待攀升。
Severity reasoning: 用户可见的严重性能下降(p99 15s vs 80ms基线),8%请求失败,影响核心商品浏览功能,且发生在黑五高收入时段,符合SEV1定义。
deepseek-chat·prompt v1·output: zh·14977ms·1835↑ / 1842↓ tok·$0.00252
Root cause hypotheses
- high缓存键统一在09:00 UTC过期,导致所有请求同时穿透至数据库
Evidence: Redis缓存命中率从98%骤降至5%,日志显示08:59:58最后一次缓存命中,09:00:00开始大量'cache miss + DB query'日志
- high黑五营销邮件导致流量突增12倍,超出数据库连接池容量
Evidence: 09:00 UTC流量为正常12倍,DB连接池在09:00:02达到上限,查询开始排队
- medium数据库锁等待加剧性能问题,导致CPU 100%
Evidence: Postgres CPU 100%且锁等待持续攀升,表明查询互相阻塞
- high预缓存TTL设置不当(7小时),未考虑黑五流量模式
Evidence: 缓存预加热在02:00 UTC,TTL 7h正好在09:00过期,与流量高峰重合
Investigation checklist
- 确认当前数据库连接数和活跃查询
SELECT count(*) FROM pg_stat_activity WHERE state = 'active';Expected: 活跃连接数接近max_connections(如200),大量查询处于'active'状态
- 检查是否有长时间运行的查询阻塞
SELECT pid, now() - pg_stat_activity.query_start AS duration, query, state FROM pg_stat_activity WHERE state != 'idle' ORDER BY duration DESC LIMIT 10;Expected: 多个查询运行超过10秒,且查询内容为商品详情SQL
- 确认Redis缓存键的过期时间分布
redis-cli -h <redis-host> -p 6379 SCAN 0 MATCH 'catalog:item:*' COUNT 1000 | xargs -I {} redis-cli -h <redis-host> -p 6379 TTL {} | sort | uniq -c | sort -rnExpected: 大量键TTL为0或接近0,表明同时过期
- 查看catalog-svc的请求速率和错误率
kubectl -n production top pods -l app=catalog-svcExpected: Pod CPU和内存使用率接近上限,请求速率异常高
- 检查数据库CPU使用率是否由慢查询导致
SELECT query, calls, total_time / calls AS avg_time, rows FROM pg_stat_statements ORDER BY total_time DESC LIMIT 10;Expected: 商品详情查询(如SELECT * FROM items WHERE id = ?)平均执行时间远高于基线
Mitigation plan
立即启用catalog-svc的限流机制,将请求速率限制在正常流量的2倍,保护数据库
Risk: 部分用户会收到429限流错误,但可防止数据库完全崩溃
Rollback: 关闭限流:kubectl annotate ingress catalog-svc-svc nginx.ingress.kubernetes.io/limit-rps-
手动预热缓存:通过脚本批量查询热门商品ID,将结果写入Redis,设置TTL为2小时(错开过期时间)
Risk: 预热查询本身会增加数据库负载,需控制并发度(如每次10个查询,间隔100ms)
Rollback: 停止预热脚本,并删除已写入的缓存键:redis-cli -h <redis-host> KEYS 'catalog:item:*' | xargs redis-cli DEL
临时增加数据库连接池大小(如从200增至400),并重启catalog-svc以应用新配置
Risk: 过多连接可能导致数据库内存压力,需监控
Rollback: 恢复连接池大小并重启:kubectl set env deployment/catalog-svc DB_POOL_SIZE=200 && kubectl rollout restart deployment/catalog-svc
Customer impact
黑五期间,所有访问商品详情页的用户遭遇严重延迟(15秒以上),约8%的请求直接失败(503)。用户无法正常浏览商品,可能导致购物车放弃和收入损失。预计在缓存预热完成后15分钟内恢复。
Postmortem draft
Postmortem: catalog-svc 缓存雪崩导致黑五服务降级
摘要
- 日期: 2025-01-27
- 影响时长: 09:00 - 09:30 UTC
- 严重等级: SEV1
- 负责人: [服务所有者]
时间线
- 08:59:58 UTC: 最后一次缓存命中
- 09:00:00 UTC: 缓存雪崩开始,DB负载飙升
- 09:02 UTC: 告警触发
- 09:05 UTC: 诊断确认缓存雪崩
- 09:10 UTC: 启用限流
- 09:15 UTC: 开始手动预热缓存
- 09:30 UTC: 服务恢复
影响
- p99延迟: 15s (基线80ms)
- 错误率: 8% 503
- 受影响用户: 所有黑五访客
根因
- 缓存预加热TTL固定7小时,导致所有键在09:00同时过期
- 黑五流量12倍激增,缓存穿透后DB无法承受
- 无缓存雪崩防护机制(如互斥锁、渐进式过期)
做得好的
- 告警及时,on-call快速响应
- 限流机制有效防止DB完全崩溃
做得不好的
- 缓存TTL策略未考虑流量高峰
- 缺乏缓存雪崩的自动化防护
- 黑五前未进行负载测试
行动项
- [ ] 实现缓存随机过期时间(TTL ± 20%)
- [ ] 引入缓存雪崩防护:单飞模式(singleflight)或分布式锁
- [ ] 黑五前进行全链路压测
- [ ] 增加缓存命中率告警
- [ ] 制定缓存预热策略,错开过期时间
Follow-ups
- P0实现缓存键TTL随机化,避免同时过期— service owner
- P0引入singleflight模式防止缓存穿透— service owner
- P1黑五前进行全链路压测,模拟缓存雪崩场景— platform team
- P1增加缓存命中率告警(低于90%触发)— on-call SRE
- P2审查所有预缓存任务的TTL设置,确保错开过期时间— service owner
- P2编写缓存雪崩应急手册,包含自动限流和预热脚本— on-call SRE
Similar past incidents
lexical match (pg_trgm)
- 73%
[Eval][v2][zh] Cache stampede after Redis key expiry on Black Friday morning
DB CPU 100%, p99 latency 15s, intermittent 503s, Redis CPU normal but cache miss rate at 95%
- 69%
[Eval][v1][zh] Cache stampede after Redis key expiry on Black Friday morning
DB CPU 100%, p99 latency 15s, intermittent 503s, Redis CPU normal but cache miss rate at 95%
- 48%
[Eval][v2][en] Cache stampede after Redis key expiry on Black Friday morning
DB CPU 100%, p99 latency 15s, intermittent 503s, Redis CPU normal but cache miss rate at 95%
- 45%
[Eval][v1][en] Cache stampede after Redis key expiry on Black Friday morning
DB CPU 100%, p99 latency 15s, intermittent 503s, Redis CPU normal but cache miss rate at 95%
- 18%
[Eval][v1][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts