[Eval][v2][zh] Order service OOM crashloop following v3.7 deploy
service: order-svccreated: 6/2/2026, 11:53:51 PM
Raw incident context
Time: 09:15 UTC. order-svc pods entering CrashLoopBackOff with OOMKilled exit reason. Kubernetes: - Pods: 12 total, 3 currently OOMKilled, 9 running - Memory limit: 512Mi, requests: 256Mi - Recent restarts/hour: 18 (up from baseline of 0–1) - HPA scaling has hit max replicas (12) Memory trend (last 24h): - Steady linear growth from 180Mi at 04:00 UTC to 500Mi at 09:00 UTC, then crash - Pre-deploy baseline: stable at 200Mi indefinitely Deploy history: - 04:00 UTC — order-svc v3.7 deployed (rolling, 100% complete by 04:08) - Changelog: "switched JSON parser to fast-json-stringify; added in-process request cache (Map-based, no eviction)" Logs: - No FATAL/ERROR pattern; pods exit silently - Heap profile from one captured pod shows: ~340Mi held by Map keyed on request ID, never cleared On-call: - 09:12 — first OOMKilled pod - 09:14 — pager - 09:16 — service still serving most traffic via remaining pods; checkout success rate dipped from 99.7% → 96.8%
Summary
order-svc 自 09:12 UTC 起出现 Pod OOMKilled 重启循环,p99 延迟升高,约 3% 请求超时。根本原因是 v3.7 部署引入的内存泄漏:基于 Map 的请求缓存无驱逐策略,导致内存线性增长直至 OOM。当前 12 个 Pod 中 3 个已崩溃,其余 9 个内存使用接近上限,HPA 已达最大副本数。
Severity reasoning: 用户可见的故障:错误率从 0.3% 升至 3.2%(>1% 超过 5 分钟),且影响结账流程(收入路径),符合 SEV1 标准。
deepseek-chat·prompt v2·output: zh·21938ms·2167↑ / 1631↓ tok·$0.00238
Root cause hypotheses
- highv3.7 部署中新增的基于 Map 的请求缓存未设置驱逐策略,导致内存泄漏
Evidence: 堆分析显示 ~340Mi 被请求 ID 键控的 Map 持有,且从未清理;部署后内存从 180Mi 线性增长至 500Mi。
- mediumJSON 解析器 fast-json-stringify 存在内存分配问题,导致对象无法回收
Evidence: 变更日志提到切换了解析器,但堆分析未显示解析器相关对象异常;内存增长模式与缓存泄漏一致。
- low外部依赖(如数据库连接池)未正确关闭,导致内存泄漏
Evidence: 日志中无连接错误,且内存增长与请求量相关而非时间,与连接池泄漏模式不符。
Investigation checklist
- 确认当前存活 Pod 的内存使用趋势
kubectl top pod -n prod -l app=order-svc --sort-by=memory | head -15Expected: 内存使用接近 512Mi 限制,且持续增长
- 检查 v3.7 部署的变更详情,确认缓存实现
kubectl get deployment order-svc -n prod -o jsonpath='{.spec.template.spec.containers[0].image}' && kubectl rollout history deployment order-svc -n prod --revision=$(kubectl rollout history deployment order-svc -n prod | grep v3.7 | awk '{print $1}')Expected: 镜像标签为 v3.7,变更日志包含 'in-process request cache (Map-based, no eviction)'
- 从运行中的 Pod 抓取堆转储,分析 Map 大小
kubectl exec -n prod deploy/order-svc -- sh -c 'jcmd 1 GC.heap_dump /tmp/heap.hprof' && kubectl cp -n prod deploy/order-svc:/tmp/heap.hprof ./heap.hprofExpected: 堆转储显示 java.util.HashMap 实例占用大量内存,键为请求 ID
- 检查 HPA 配置和当前副本数
kubectl get hpa order-svc -n prod -o yamlExpected: maxReplicas=12,当前 replicas=12,已达到上限
- 检查是否有其他服务调用 order-svc 导致请求激增
kubectl logs -n prod -l app=order-svc --since=30m | grep -c 'incoming request'Expected: 请求速率与基线相比无明显增加
Mitigation plan
立即回滚 order-svc 至 v3.6(上一个稳定版本)
Risk: 回滚期间可能出现短暂连接中断(约 30 秒),但无数据丢失风险
Rollback: 重新部署 v3.7(若回滚后问题解决,则无需进一步操作)
若回滚不可行,临时增加 Pod 内存限制至 1Gi 并重启所有 Pod
Risk: 增加内存限制可能掩盖根本问题,且节点资源可能不足;重启会导致所有 Pod 同时不可用
Rollback: 恢复内存限制至 512Mi 并再次回滚版本
在回滚后,手动清理缓存相关的代码或添加驱逐策略
Risk: 代码修改需经过测试,直接修补可能引入新问题
Rollback: 恢复至回滚后的 v3.6 状态
Customer impact
部分用户在下单时遇到超时或错误,结账成功率从 99.7% 下降至 96.8%。影响范围约为 3% 的请求。预计在回滚后 5 分钟内恢复。
Postmortem draft
Summary
[FILL IN]
Timeline (UTC)
- 04:00 - order-svc v3.7 部署完成
- 09:12 - 首个 Pod OOMKilled
- 09:14 - 分页触发
- 09:16 - 确认错误率上升至 3.2%
- [FILL IN] - 开始回滚
- [FILL IN] - 服务恢复
Impact
结账成功率从 99.7% 降至 96.8%,影响约 3% 的请求。3 个 Pod 崩溃,其余 9 个内存使用接近上限。
Root Cause
v3.7 引入的基于 Map 的请求缓存无驱逐策略,导致内存线性增长直至 OOM。
Detection
监控告警触发(Pod OOMKilled 和错误率升高)。
Response
[FILL IN]
What Went Well
[FILL IN]
What Went Poorly
[FILL IN]
Action Items
[FILL IN]
Follow-ups
- P0修复请求缓存,添加基于 LRU 的驱逐策略或使用弱引用— order-svc 服务所有者
- P1增加内存使用率告警,设置 80% 阈值提前预警— 平台团队
- P1在部署流水线中添加内存泄漏检测(如集成测试中监控内存增长)— CI/CD 团队
- P2审查 fast-json-stringify 的内存使用情况,确保无泄漏— order-svc 服务所有者
- P2更新 HPA 配置,允许基于内存使用率的自动扩缩容— 平台团队
Similar past incidents
lexical match (pg_trgm)
- 35%
[Eval][v1][en] Order service OOM crashloop following v3.7 deploy
Pods OOM-killed every ~20min, restart loop, p99 latency degraded, ~3% requests timing out
- 33%
[Eval][v2][en] Order service OOM crashloop following v3.7 deploy
Pods OOM-killed every ~20min, restart loop, p99 latency degraded, ~3% requests timing out
- 18%
[Eval][v1][zh] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 18%
[Eval][v2][zh] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 17%
[Eval][v1][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts