[Eval][v1][zh] Order service OOM crashloop following v3.7 deploy
service: order-svccreated: 6/2/2026, 11:52:00 PM
Raw incident context
Time: 09:15 UTC. order-svc pods entering CrashLoopBackOff with OOMKilled exit reason. Kubernetes: - Pods: 12 total, 3 currently OOMKilled, 9 running - Memory limit: 512Mi, requests: 256Mi - Recent restarts/hour: 18 (up from baseline of 0–1) - HPA scaling has hit max replicas (12) Memory trend (last 24h): - Steady linear growth from 180Mi at 04:00 UTC to 500Mi at 09:00 UTC, then crash - Pre-deploy baseline: stable at 200Mi indefinitely Deploy history: - 04:00 UTC — order-svc v3.7 deployed (rolling, 100% complete by 04:08) - Changelog: "switched JSON parser to fast-json-stringify; added in-process request cache (Map-based, no eviction)" Logs: - No FATAL/ERROR pattern; pods exit silently - Heap profile from one captured pod shows: ~340Mi held by Map keyed on request ID, never cleared On-call: - 09:12 — first OOMKilled pod - 09:14 — pager - 09:16 — service still serving most traffic via remaining pods; checkout success rate dipped from 99.7% → 96.8%
Summary
order-svc 自 04:00 UTC 部署 v3.7 后,内存持续线性增长,约每 20 分钟触发 OOMKilled,导致 Pod 重启循环。当前 3 个 Pod 已 OOM,9 个运行中,p99 延迟升高,3% 请求超时,结账成功率从 99.7% 降至 96.8%。根因是新增的进程内请求缓存(Map)未设置淘汰策略,导致内存泄漏。
Severity reasoning: 影响用户结账流程,成功率下降约 3%,但服务仍部分可用;非完全 outage,无数据丢失,故定为 SEV2。
deepseek-chat·prompt v1·output: zh·10056ms·1786↑ / 1460↓ tok·$0.00209
Root cause hypotheses
- highv3.7 新增的进程内请求缓存(Map)未设置淘汰策略,导致内存持续增长直至 OOM
Evidence: Heap profile 显示约 340Mi 被以 request ID 为 key 的 Map 持有,且从未清理;内存趋势从 180Mi 线性增长至 500Mi 后崩溃。
- lowJSON 解析器切换(fast-json-stringify)导致对象生命周期延长,间接增加内存压力
Evidence: 变更同时切换了 JSON 解析器,但 heap profile 主要显示 Map 占用,解析器本身通常不会导致线性增长。
- low外部依赖(如数据库)返回结果集增大,导致缓存填充加速
Evidence: 无相关日志或指标支持;内存增长模式与部署时间强相关,而非业务流量变化。
Investigation checklist
- 确认当前 Pod 内存使用趋势
kubectl top pod -l app=order-svc --sort-by=memoryExpected: 显示内存使用接近 512Mi 的 Pod,且持续增长
- 检查最近 OOM Pod 的退出原因
kubectl get pods -l app=order-svc --field-selector=status.phase=Failed -o json | jq '.items[].status.containerStatuses[].lastState.terminated.reason'Expected: 输出 'OOMKilled'
- 查看 v3.7 变更的代码,确认缓存实现
git diff v3.6..v3.7 -- order-svc/ | grep -A 20 'Map'Expected: 显示新增的 Map 缓存,无 eviction 或 TTL 逻辑
- 检查 HPA 是否达到上限
kubectl describe hpa order-svc-hpa | grep -E 'Min|Max|Replicas'Expected: 当前副本数 12,已达最大副本数
- 验证缓存是否可手动触发清理
kubectl exec <pod> -- curl -s localhost:8080/debug/pprof/heap?debug=1 | head -100Expected: 显示 Map 对象占主导
Mitigation plan
立即回滚 order-svc 至 v3.6(上一个稳定版本)
Risk: 回滚期间可能有短暂连接中断,但比持续 OOM 风险低
Rollback: 重新部署 v3.7(但需先修复缓存问题)
如果回滚不可行,临时增加 Pod 内存 limit 至 1Gi,并重启所有 Pod
Risk: 仅延缓 OOM,不解决根因;可能影响集群资源分配
Rollback: 恢复内存 limit 至 512Mi
在代码修复前,通过配置关闭缓存功能(如环境变量 CACHE_ENABLED=false)
Risk: 可能增加下游请求延迟,但避免 OOM
Rollback: 重新启用缓存
Customer impact
约 3% 的结账请求超时或失败,p99 延迟升高,部分用户可能遇到错误页面。当前服务仍处理大部分流量,但持续恶化可能导致完全 outage。
Postmortem draft
Postmortem: order-svc OOMKilled 循环
摘要
[2-3 句概述]
时间线
- 04:00 UTC: 部署 v3.7
- 09:12 UTC: 首次 OOMKilled
- 09:14 UTC: 触发告警
- 09:16 UTC: 开始调查
- [回滚时间]
影响
- 结账成功率从 99.7% 降至 96.8%
- 3 个 Pod 持续 OOM
- p99 延迟升高
根因
v3.7 引入的进程内请求缓存(Map)未设置淘汰策略,导致内存泄漏。
做得好的
- 快速发现内存增长趋势
- 及时回滚
做得不好的
- 代码审查未发现缓存无淘汰策略
- 未在 staging 环境进行长时间压力测试
行动项
- [ ] 为缓存添加 TTL 和最大条目限制
- [ ] 增加内存使用告警
- [ ] 更新部署流程,要求性能测试
Follow-ups
- P0为进程内缓存添加 TTL 和最大条目限制,并修复 v3.7— service owner
- P1增加内存使用率告警(>80% limit)— on-call SRE
- P1更新部署流水线,加入内存压力测试— platform team
- P2审查所有使用 Map 作为缓存的代码,确保有淘汰策略— service owner
Similar past incidents
lexical match (pg_trgm)
- 35%
[Eval][v1][en] Order service OOM crashloop following v3.7 deploy
Pods OOM-killed every ~20min, restart loop, p99 latency degraded, ~3% requests timing out
- 34%
[Eval][v2][en] Order service OOM crashloop following v3.7 deploy
Pods OOM-killed every ~20min, restart loop, p99 latency degraded, ~3% requests timing out
- 18%
[Eval][v2][zh] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 17%
[Eval][v1][zh] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 17%
[Eval][v1][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts