[Eval][v1][zh] Order service OOM crashloop following v3.7 deploy
service: order-svccreated: 6/2/2026, 11:36:46 PM
Raw incident context
Time: 09:15 UTC. order-svc pods entering CrashLoopBackOff with OOMKilled exit reason. Kubernetes: - Pods: 12 total, 3 currently OOMKilled, 9 running - Memory limit: 512Mi, requests: 256Mi - Recent restarts/hour: 18 (up from baseline of 0–1) - HPA scaling has hit max replicas (12) Memory trend (last 24h): - Steady linear growth from 180Mi at 04:00 UTC to 500Mi at 09:00 UTC, then crash - Pre-deploy baseline: stable at 200Mi indefinitely Deploy history: - 04:00 UTC — order-svc v3.7 deployed (rolling, 100% complete by 04:08) - Changelog: "switched JSON parser to fast-json-stringify; added in-process request cache (Map-based, no eviction)" Logs: - No FATAL/ERROR pattern; pods exit silently - Heap profile from one captured pod shows: ~340Mi held by Map keyed on request ID, never cleared On-call: - 09:12 — first OOMKilled pod - 09:14 — pager - 09:16 — service still serving most traffic via remaining pods; checkout success rate dipped from 99.7% → 96.8%
Summary
order-svc 自 04:00 UTC 部署 v3.7 后,内存持续线性增长,约每 20 分钟触发 OOMKilled,导致 Pod 重启循环。当前 3 个 Pod 已 OOM,9 个运行中,p99 延迟升高,3% 请求超时,结账成功率从 99.7% 降至 96.8%。根因是新增的进程内请求缓存(Map 实现)未设置淘汰策略,导致内存泄漏。
Severity reasoning: 影响用户结账流程(成功率下降 3%),但服务仍部分可用;非完全 outage,无数据丢失;属于用户可见的显著降级,符合 SEV2 定义。
deepseek-chat·prompt v1·output: zh·9509ms·1786↑ / 1400↓ tok·$0.00202
Root cause hypotheses
- highv3.7 引入的进程内请求缓存(Map)未设置淘汰策略,导致内存无限增长
Evidence: 堆快照显示约 340Mi 被以 request ID 为键的 Map 占用,且从未清理;内存趋势从 04:00 部署后线性增长。
- lowfast-json-stringify 库存在内存泄漏
Evidence: 变更日志提到切换了 JSON 解析器,但堆快照主要显示 Map 占用,而非序列化缓冲区。
- lowKubernetes 节点内存压力导致 Pod 被驱逐
Evidence: Pod 退出原因为 OOMKilled,而非 Evicted;节点内存未报告异常。
Investigation checklist
- 确认缓存 Map 大小
kubectl exec <pod-name> -- sh -c 'curl -s localhost:8080/debug/pprof/heap?debug=1 | grep -A5 "Map"'Expected: 显示 Map 条目数持续增长,无上限
- 检查缓存是否被正确清理
kubectl logs <pod-name> --tail=100 | grep -i 'cache\|evict\|clear'Expected: 无清理相关日志
- 验证内存增长与请求速率的关系
kubectl exec <pod-name> -- sh -c 'curl -s localhost:8080/metrics | grep -E "http_requests_total|process_resident_memory_bytes"'Expected: 内存随请求数线性增长,无回收迹象
- 检查 HPA 是否达到上限
kubectl describe hpa order-svc-hpaExpected: 当前副本数 12/12,已达最大
Mitigation plan
回滚 order-svc 至 v3.6(上一个稳定版本)
Risk: 回滚期间可能出现短暂连接中断(约 30 秒),但比当前持续 OOM 更安全
Rollback: 重新部署 v3.7 镜像即可恢复
临时增加 Pod 内存限制至 1Gi,缓解 OOM 频率
Risk: 可能掩盖问题,且节点资源可能不足;需确保节点有足够内存
Rollback: 恢复内存限制至 512Mi
在代码中为缓存添加 TTL 淘汰策略(如 5 分钟)
Risk: 需要重新构建镜像并部署,耗时较长;临时方案优先
Rollback: 回滚至无缓存的版本
Customer impact
约 3% 的结账请求超时或失败,p99 延迟升高,部分用户可能遇到错误页面。服务整体可用,但体验下降。预计在回滚后 10 分钟内恢复。
Postmortem draft
事后复盘
概要
order-svc v3.7 部署后因内存泄漏导致 Pod 频繁 OOM,影响结账成功率。
时间线
- 04:00 UTC 部署 v3.7
- 09:12 首次 OOM
- 09:14 触发告警
- 09:16 开始调查
- 09:30 回滚至 v3.6,服务恢复
影响
- 结账成功率从 99.7% 降至 96.8%
- 3 个 Pod 持续 OOM,其余 9 个负载升高
- 影响时长约 18 分钟
根因
v3.7 新增的进程内请求缓存(Map)未设置淘汰策略,导致内存无限增长,最终 OOM。
做得好的
- 告警及时
- 堆快照快速定位问题
做得不好的
- 代码审查未发现缓存无淘汰策略
- 未进行内存压力测试
行动项
- [ ] 为缓存添加 TTL 和最大条目限制
- [ ] 增加内存使用告警(>80% limit)
- [ ] 在 CI 中加入内存泄漏检测
Follow-ups
- P0为进程内缓存添加 TTL 和最大条目限制— service owner
- P1增加内存使用率告警,当超过 limit 80% 时触发— on-call SRE
- P1在 CI 流水线中加入内存泄漏检测(如 pprof 对比)— platform team
- P2审查 v3.7 其他变更,确保无类似问题— service owner
Similar past incidents
lexical match (pg_trgm)
- 35%
[Eval][v1][en] Order service OOM crashloop following v3.7 deploy
Pods OOM-killed every ~20min, restart loop, p99 latency degraded, ~3% requests timing out
- 34%
[Eval][v2][en] Order service OOM crashloop following v3.7 deploy
Pods OOM-killed every ~20min, restart loop, p99 latency degraded, ~3% requests timing out
- 18%
[Eval][v2][zh] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 17%
[Eval][v1][zh] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 17%
[Eval][v1][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts