[Eval][v2][zh] Cache stampede after Redis key expiry on Black Friday morning
service: catalog-svccreated: 6/3/2026, 12:08:50 AM
Raw incident context
Time: 09:00 UTC, Black Friday. catalog-svc latency exploded at exactly 09:00:00 UTC. Symptoms: - catalog-svc p99: 15s (baseline 80ms) - 503 rate: 8% (intermittent during DB overload) - Postgres CPU: 100% sustained, lock waits climbing - Redis CPU: 25% (looks healthy) - Redis cache miss rate for keys matching "catalog:item:*": 95% (baseline 2%) - All product detail page requests are flooding through to DB Background: - We pre-warm the homepage catalog cache every night at 02:00 UTC with TTL=7h - 02:00 UTC + 7h = 09:00 UTC ← all keys expired simultaneously - Black Friday traffic ramp: 12x normal at 09:00 UTC (marketing email blast) - No per-key locking; every cache miss triggers a fresh DB query Logs: - 08:59:58 UTC — last cache hit logged - 09:00:00 UTC — first wave of "cache miss + DB query" log lines - 09:00:02 UTC — DB connection pool hits ceiling, queries start queueing On-call: - 09:02 — pager (multiple alerts: DBHighCPU, CatalogSvcLatencyHigh, CheckoutErrorRate) - 09:03 — diagnosis: cache stampede confirmed by Redis miss-rate metric - 09:05 — debating: extend TTL (won't help in flight), warm cache manually (DB is the bottleneck), or shed load
Summary
09:00 UTC 黑五流量高峰期间,catalog-svc 缓存雪崩导致 PostgreSQL CPU 100%、p99 延迟 15s、间歇性 503 错误。缓存 TTL 统一在 09:00 过期,所有请求穿透至数据库,造成连接池耗尽和锁等待。
Severity reasoning: 用户可见故障:p99 延迟 15s(基线 80ms),503 错误率 8% > 1% 且持续超过 5 分钟,收入路径(商品详情页)中断,符合 SEV1 定义。
deepseek-chat·prompt v2·output: zh·11330ms·2216↑ / 1480↓ tok·$0.00223
Root cause hypotheses
- high缓存 TTL 统一设置为 7 小时,导致所有 catalog:item:* 键在 09:00 同时过期,引发缓存雪崩
Evidence: Redis 缓存命中率从 2% 基线骤降至 95% 未命中,且最后缓存命中日志在 08:59:58,09:00:00 开始大量缓存未命中日志
- medium黑五流量 12 倍于正常,数据库连接池配置不足,无法处理突发查询
Evidence: 09:00:02 数据库连接池达到上限,查询开始排队,CPU 100% 持续
- low数据库查询效率低下,缺少针对高并发场景的索引优化
Evidence: 单个缓存未命中查询可能涉及复杂 JOIN 或全表扫描,但当前无直接证据
Investigation checklist
- 确认缓存雪崩范围:检查 Redis 中 catalog:item:* 键的 TTL 分布
redis-cli -h <redis-host> -p 6379 SCAN 0 MATCH 'catalog:item:*' COUNT 10000 | xargs -I {} redis-cli -h <redis-host> -p 6379 TTL {} | sort | uniq -c | sort -rn | head -20Expected: 大量键 TTL 为 0 或接近 0,验证同时过期
- 检查数据库当前活跃查询和等待事件
SELECT pid, query_start, state, wait_event_type, wait_event, query FROM pg_stat_activity WHERE state != 'idle' ORDER BY query_start LIMIT 20;Expected: 大量查询处于 'active' 状态,wait_event 为 'LWLock' 或 'Lock',查询内容为商品详情查询
- 检查数据库连接池使用率
SELECT count(*) AS used_connections, setting AS max_connections FROM pg_stat_activity, pg_settings WHERE name = 'max_connections' GROUP BY setting;Expected: used_connections 接近 max_connections,确认连接池耗尽
- 检查 Redis 缓存未命中率趋势
kubectl exec -n prod deploy/redis-metrics -- redis-cli -h <redis-host> -p 6379 INFO stats | grep -E 'keyspace_misses|keyspace_hits'Expected: keyspace_misses 远高于 keyspace_hits,确认缓存未命中率 > 90%
Mitigation plan
立即启用 catalog-svc 的限流机制,对商品详情 API 进行请求速率限制,保护数据库
Risk: 部分用户请求被拒绝,但可防止数据库完全崩溃;限流阈值需合理设置,避免过度拒绝
Rollback: 移除限流配置或提高限流阈值
手动预热缓存:通过脚本批量查询热门商品并写入 Redis,但需注意避免对数据库造成额外压力
Risk: 预热查询可能进一步压垮数据库;建议先限流再预热,或从只读副本读取
Rollback: 停止预热脚本
紧急扩展数据库只读副本,将读流量分流至副本
Risk: 扩展操作需要时间,且副本可能同样承受压力;需确保副本配置正确
Rollback: 缩容副本或移除副本路由
Customer impact
用户访问商品详情页时遇到高延迟(15秒以上)或间歇性 503 错误,影响所有通过首页和营销邮件进入的用户。预计受影响用户数占黑五流量的 80% 以上。当前无 ETA。
Postmortem draft
Summary
09:00 UTC 黑五流量高峰期间,catalog-svc 缓存雪崩导致 PostgreSQL CPU 100%、p99 延迟 15s、间歇性 503 错误。缓存 TTL 统一在 09:00 过期,所有请求穿透至数据库,造成连接池耗尽和锁等待。
Timeline (UTC)
[FILL IN]
Impact
[FILL IN]
Root Cause
缓存 TTL 统一设置为 7 小时,导致所有 catalog:item:* 键在 09:00 同时过期,引发缓存雪崩。
Detection
[FILL IN]
Response
[FILL IN]
What Went Well
[FILL IN]
What Went Poorly
[FILL IN]
Action Items
[FILL IN]
Follow-ups
- P0为缓存键添加随机 TTL 偏移,避免同时过期— catalog-svc 团队
- P0实现缓存未命中时的分布式锁或回退机制(如 single-flight)— 平台团队
- P1增加数据库连接池监控告警,设置连接使用率阈值— SRE 团队
- P1黑五前进行缓存雪崩演练,验证限流和预热流程— SRE 团队
- P2评估商品详情查询性能,添加必要索引— catalog-svc 团队
Similar past incidents
lexical match (pg_trgm)
- 77%
[Eval][v1][zh] Cache stampede after Redis key expiry on Black Friday morning
DB CPU 100%, p99 latency 15s, intermittent 503s, Redis CPU normal but cache miss rate at 95%
- 77%
[Eval][v2][zh] Cache stampede after Redis key expiry on Black Friday morning
DB CPU 100%, p99 latency 15s, intermittent 503s, Redis CPU normal but cache miss rate at 95%
- 50%
[Eval][v2][en] Cache stampede after Redis key expiry on Black Friday morning
DB CPU 100%, p99 latency 15s, intermittent 503s, Redis CPU normal but cache miss rate at 95%
- 46%
[Eval][v1][en] Cache stampede after Redis key expiry on Black Friday morning
DB CPU 100%, p99 latency 15s, intermittent 503s, Redis CPU normal but cache miss rate at 95%
- 19%
[Eval][v2][en] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.