← All incidents

[Eval][v1][zh] Cache stampede after Redis key expiry on Black Friday morning

service: catalog-svccreated: 6/3/2026, 12:06:58 AM

Raw incident context

Time: 09:00 UTC, Black Friday. catalog-svc latency exploded at exactly 09:00:00 UTC.

Symptoms:
- catalog-svc p99: 15s (baseline 80ms)
- 503 rate: 8% (intermittent during DB overload)
- Postgres CPU: 100% sustained, lock waits climbing
- Redis CPU: 25% (looks healthy)
- Redis cache miss rate for keys matching "catalog:item:*": 95% (baseline 2%)
- All product detail page requests are flooding through to DB

Background:
- We pre-warm the homepage catalog cache every night at 02:00 UTC with TTL=7h
- 02:00 UTC + 7h = 09:00 UTC ← all keys expired simultaneously
- Black Friday traffic ramp: 12x normal at 09:00 UTC (marketing email blast)
- No per-key locking; every cache miss triggers a fresh DB query

Logs:
- 08:59:58 UTC — last cache hit logged
- 09:00:00 UTC — first wave of "cache miss + DB query" log lines
- 09:00:02 UTC — DB connection pool hits ceiling, queries start queueing

On-call:
- 09:02 — pager (multiple alerts: DBHighCPU, CatalogSvcLatencyHigh, CheckoutErrorRate)
- 09:03 — diagnosis: cache stampede confirmed by Redis miss-rate metric
- 09:05 — debating: extend TTL (won't help in flight), warm cache manually (DB is the bottleneck), or shed load

Summary

SEV1

09:00 UTC黑五流量高峰期间,catalog-svc因缓存雪崩导致数据库CPU 100%,p99延迟15秒,8%请求返回503。所有商品详情页请求穿透至PostgreSQL,连接池耗尽,锁等待攀升。

Severity reasoning: 用户可见的严重性能下降(p99 15s vs 80ms基线),8%请求失败,影响核心商品浏览功能,且发生在黑五高收入时段,符合SEV1定义。

deepseek-chat·prompt v1·output: zh·14977ms·1835↑ / 1842↓ tok·$0.00252

Root cause hypotheses

  • high缓存键统一在09:00 UTC过期,导致所有请求同时穿透至数据库

    Evidence: Redis缓存命中率从98%骤降至5%,日志显示08:59:58最后一次缓存命中,09:00:00开始大量'cache miss + DB query'日志

  • high黑五营销邮件导致流量突增12倍,超出数据库连接池容量

    Evidence: 09:00 UTC流量为正常12倍,DB连接池在09:00:02达到上限,查询开始排队

  • medium数据库锁等待加剧性能问题,导致CPU 100%

    Evidence: Postgres CPU 100%且锁等待持续攀升,表明查询互相阻塞

  • high预缓存TTL设置不当(7小时),未考虑黑五流量模式

    Evidence: 缓存预加热在02:00 UTC,TTL 7h正好在09:00过期,与流量高峰重合

Investigation checklist

  1. 确认当前数据库连接数和活跃查询
    SELECT count(*) FROM pg_stat_activity WHERE state = 'active';

    Expected: 活跃连接数接近max_connections(如200),大量查询处于'active'状态

  2. 检查是否有长时间运行的查询阻塞
    SELECT pid, now() - pg_stat_activity.query_start AS duration, query, state FROM pg_stat_activity WHERE state != 'idle' ORDER BY duration DESC LIMIT 10;

    Expected: 多个查询运行超过10秒,且查询内容为商品详情SQL

  3. 确认Redis缓存键的过期时间分布
    redis-cli -h <redis-host> -p 6379 SCAN 0 MATCH 'catalog:item:*' COUNT 1000 | xargs -I {} redis-cli -h <redis-host> -p 6379 TTL {} | sort | uniq -c | sort -rn

    Expected: 大量键TTL为0或接近0,表明同时过期

  4. 查看catalog-svc的请求速率和错误率
    kubectl -n production top pods -l app=catalog-svc

    Expected: Pod CPU和内存使用率接近上限,请求速率异常高

  5. 检查数据库CPU使用率是否由慢查询导致
    SELECT query, calls, total_time / calls AS avg_time, rows FROM pg_stat_statements ORDER BY total_time DESC LIMIT 10;

    Expected: 商品详情查询(如SELECT * FROM items WHERE id = ?)平均执行时间远高于基线

Mitigation plan

  • 立即启用catalog-svc的限流机制,将请求速率限制在正常流量的2倍,保护数据库

    Risk: 部分用户会收到429限流错误,但可防止数据库完全崩溃

    Rollback: 关闭限流:kubectl annotate ingress catalog-svc-svc nginx.ingress.kubernetes.io/limit-rps-

  • 手动预热缓存:通过脚本批量查询热门商品ID,将结果写入Redis,设置TTL为2小时(错开过期时间)

    Risk: 预热查询本身会增加数据库负载,需控制并发度(如每次10个查询,间隔100ms)

    Rollback: 停止预热脚本,并删除已写入的缓存键:redis-cli -h <redis-host> KEYS 'catalog:item:*' | xargs redis-cli DEL

  • 临时增加数据库连接池大小(如从200增至400),并重启catalog-svc以应用新配置

    Risk: 过多连接可能导致数据库内存压力,需监控

    Rollback: 恢复连接池大小并重启:kubectl set env deployment/catalog-svc DB_POOL_SIZE=200 && kubectl rollout restart deployment/catalog-svc

Customer impact

黑五期间,所有访问商品详情页的用户遭遇严重延迟(15秒以上),约8%的请求直接失败(503)。用户无法正常浏览商品,可能导致购物车放弃和收入损失。预计在缓存预热完成后15分钟内恢复。

Postmortem draft

Postmortem: catalog-svc 缓存雪崩导致黑五服务降级

摘要

  • 日期: 2025-01-27
  • 影响时长: 09:00 - 09:30 UTC
  • 严重等级: SEV1
  • 负责人: [服务所有者]

时间线

  • 08:59:58 UTC: 最后一次缓存命中
  • 09:00:00 UTC: 缓存雪崩开始,DB负载飙升
  • 09:02 UTC: 告警触发
  • 09:05 UTC: 诊断确认缓存雪崩
  • 09:10 UTC: 启用限流
  • 09:15 UTC: 开始手动预热缓存
  • 09:30 UTC: 服务恢复

影响

  • p99延迟: 15s (基线80ms)
  • 错误率: 8% 503
  • 受影响用户: 所有黑五访客

根因

  • 缓存预加热TTL固定7小时,导致所有键在09:00同时过期
  • 黑五流量12倍激增,缓存穿透后DB无法承受
  • 无缓存雪崩防护机制(如互斥锁、渐进式过期)

做得好的

  • 告警及时,on-call快速响应
  • 限流机制有效防止DB完全崩溃

做得不好的

  • 缓存TTL策略未考虑流量高峰
  • 缺乏缓存雪崩的自动化防护
  • 黑五前未进行负载测试

行动项

  • [ ] 实现缓存随机过期时间(TTL ± 20%)
  • [ ] 引入缓存雪崩防护:单飞模式(singleflight)或分布式锁
  • [ ] 黑五前进行全链路压测
  • [ ] 增加缓存命中率告警
  • [ ] 制定缓存预热策略,错开过期时间

Follow-ups

  • P0实现缓存键TTL随机化,避免同时过期service owner
  • P0引入singleflight模式防止缓存穿透service owner
  • P1黑五前进行全链路压测,模拟缓存雪崩场景platform team
  • P1增加缓存命中率告警(低于90%触发)on-call SRE
  • P2审查所有预缓存任务的TTL设置,确保错开过期时间service owner
  • P2编写缓存雪崩应急手册,包含自动限流和预热脚本on-call SRE