[Eval][v2][zh] Order service OOM crashloop following v3.7 deploy
service: order-svccreated: 6/2/2026, 11:37:15 PM
Raw incident context
Time: 09:15 UTC. order-svc pods entering CrashLoopBackOff with OOMKilled exit reason. Kubernetes: - Pods: 12 total, 3 currently OOMKilled, 9 running - Memory limit: 512Mi, requests: 256Mi - Recent restarts/hour: 18 (up from baseline of 0–1) - HPA scaling has hit max replicas (12) Memory trend (last 24h): - Steady linear growth from 180Mi at 04:00 UTC to 500Mi at 09:00 UTC, then crash - Pre-deploy baseline: stable at 200Mi indefinitely Deploy history: - 04:00 UTC — order-svc v3.7 deployed (rolling, 100% complete by 04:08) - Changelog: "switched JSON parser to fast-json-stringify; added in-process request cache (Map-based, no eviction)" Logs: - No FATAL/ERROR pattern; pods exit silently - Heap profile from one captured pod shows: ~340Mi held by Map keyed on request ID, never cleared On-call: - 09:12 — first OOMKilled pod - 09:14 — pager - 09:16 — service still serving most traffic via remaining pods; checkout success rate dipped from 99.7% → 96.8%
Summary
order-svc 自 09:15 UTC 起出现 Pod 周期性 OOMKilled,每约 20 分钟重启一次,p99 延迟升高,约 3% 请求超时。根因是 v3.7 部署引入的基于 Map 的进程内请求缓存未设置淘汰策略,导致内存持续增长直至耗尽。
Severity reasoning: 用户可见的故障:错误率从 0.3% 升至 3.2%(>1% 超过 5 分钟),p99 延迟升高,结账成功率下降,影响收入路径。符合 SEV1 定义。
deepseek-chat·prompt v2·output: zh·10252ms·2167↑ / 1527↓ tok·$0.00226
Root cause hypotheses
- highv3.7 引入的 Map 缓存未设置淘汰策略,请求 ID 持续累积导致内存泄漏
Evidence: 堆快照显示约 340Mi 被 Map 持有,键为请求 ID,且从未清理;部署时间与内存增长起点吻合
- mediumJSON 解析器 fast-json-stringify 存在内存泄漏
Evidence: 切换解析器可能引入未释放的对象引用,但堆快照未显示解析器相关对象异常增长
- lowKubernetes 节点内存压力导致 Pod 被驱逐
Evidence: 节点内存使用率正常,仅 order-svc Pod 出现 OOMKilled,非节点级问题
Investigation checklist
- 确认 Pod 内存使用趋势和 OOM 时间点
kubectl top pod -n prod -l app=order-svc --sort-by=memory | head -15Expected: 显示内存使用接近 512Mi 的 Pod,且重启次数高
- 检查 v3.7 部署的变更详情
kubectl rollout history deployment/order-svc -n prod --revision=3.7Expected: 确认变更包括 JSON 解析器和缓存引入
- 获取 OOM Pod 的堆快照分析内存泄漏
kubectl exec -n prod -c order-svc <pod-name> -- jcmd 1 GC.heap_dump /tmp/heap.hprof && kubectl cp -n prod <pod-name>:/tmp/heap.hprof ./heap.hprofExpected: 堆转储显示 Map 对象占用大量内存,键为 requestId
- 检查缓存实现代码(GitHub)
grep -r 'Map' --include='*.js' ./order-svc/src/cache.jsExpected: 发现 new Map() 且无 delete 或 clear 调用
- 验证 JSON 解析器是否泄漏
kubectl logs -n prod -l app=order-svc --since=1h | grep -i 'fast-json' | head -10Expected: 无相关错误日志,解析器正常
Mitigation plan
立即回滚 order-svc 至 v3.6(上一个稳定版本)
Risk: 回滚期间可能短暂中断连接,但无数据丢失风险
Rollback: 重新部署 v3.7 或后续修复版本
如果回滚不可行,临时增加 Pod 内存限制至 1Gi 并重启所有 Pod
Risk: 仅缓解症状,内存仍会增长;可能触发节点资源紧张
Rollback: 恢复内存限制至 512Mi 并回滚代码
在缓存代码中添加 LRU 淘汰策略或 TTL,并限制 Map 大小
Risk: 新代码需测试,可能引入 bug
Rollback: 回滚至无缓存版本
Customer impact
约 3% 的结账请求超时或失败,p99 延迟升高,影响用户体验。受影响的用户约占活跃用户的 3%。预计在回滚后 10 分钟内恢复。
Postmortem draft
Summary
order-svc 在 v3.7 部署后因内存泄漏导致周期性 OOMKilled,影响结账成功率。
Timeline (UTC)
- 04:00 - v3.7 部署完成
- 09:12 - 首个 Pod OOMKilled
- 09:14 - 告警触发
- 09:16 - 确认影响,开始调查
- [FILL IN] - 回滚完成
- [FILL IN] - 服务恢复
Impact
- 错误率从 0.3% 升至 3.2%
- p99 延迟升高
- 约 3% 请求超时
Root Cause
v3.7 引入的进程内请求缓存(Map)未设置淘汰策略,导致内存持续增长直至 OOM。
Detection
Kubernetes Pod OOMKilled 事件触发告警,随后 pager 通知。
Response
- 09:16 开始调查
- 确认根因为缓存泄漏
- 执行回滚至 v3.6
What Went Well
- 告警及时
- 堆快照帮助快速定位
What Went Poorly
- 代码审查未发现缓存无淘汰策略
- 无内存使用告警(仅 OOM 告警)
Action Items
- [FILL IN] 添加缓存大小限制和 TTL
- [FILL IN] 添加内存使用率告警
- [FILL IN] 改进代码审查清单
Follow-ups
- P0修复缓存代码,添加 LRU 淘汰和 TTL— service owner
- P1添加 Pod 内存使用率告警(>80% 限制)— on-call SRE
- P1审查 v3.7 代码变更,确保无其他泄漏— service owner
- P2更新部署流程,要求性能测试包含内存稳定性— platform team
Similar past incidents
lexical match (pg_trgm)
- 37%
[Eval][v1][en] Order service OOM crashloop following v3.7 deploy
Pods OOM-killed every ~20min, restart loop, p99 latency degraded, ~3% requests timing out
- 35%
[Eval][v2][en] Order service OOM crashloop following v3.7 deploy
Pods OOM-killed every ~20min, restart loop, p99 latency degraded, ~3% requests timing out
- 19%
[Eval][v2][zh] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 19%
[Eval][v1][zh] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 18%
[Eval][v1][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts