[Eval][v2][zh] Order service OOM crashloop following v3.7 deploy
service: order-svccreated: 6/2/2026, 11:53:09 PM
Raw incident context
Time: 09:15 UTC. order-svc pods entering CrashLoopBackOff with OOMKilled exit reason. Kubernetes: - Pods: 12 total, 3 currently OOMKilled, 9 running - Memory limit: 512Mi, requests: 256Mi - Recent restarts/hour: 18 (up from baseline of 0–1) - HPA scaling has hit max replicas (12) Memory trend (last 24h): - Steady linear growth from 180Mi at 04:00 UTC to 500Mi at 09:00 UTC, then crash - Pre-deploy baseline: stable at 200Mi indefinitely Deploy history: - 04:00 UTC — order-svc v3.7 deployed (rolling, 100% complete by 04:08) - Changelog: "switched JSON parser to fast-json-stringify; added in-process request cache (Map-based, no eviction)" Logs: - No FATAL/ERROR pattern; pods exit silently - Heap profile from one captured pod shows: ~340Mi held by Map keyed on request ID, never cleared On-call: - 09:12 — first OOMKilled pod - 09:14 — pager - 09:16 — service still serving most traffic via remaining pods; checkout success rate dipped from 99.7% → 96.8%
Summary
order-svc 自 09:15 UTC 起出现 Pod OOMKilled 重启循环,p99 延迟升高,约 3% 请求超时。根本原因是 v3.7 部署引入的进程内请求缓存(基于 Map,无淘汰策略)导致内存线性增长,最终触发 OOM。
Severity reasoning: 用户面错误率 >1%(从 0.3% 升至 3.2%)且持续超过 5 分钟,符合 SEV1 标准。
deepseek-chat·prompt v2·output: zh·9895ms·2167↑ / 1485↓ tok·$0.00222
Root cause hypotheses
- highv3.7 引入的 Map 缓存未设置淘汰策略,导致内存持续增长直至 OOM
Evidence: 堆快照显示约 340Mi 内存被请求 ID 键控的 Map 占用,且从未清理;部署时间与内存增长起始时间吻合。
- medium内存 limit 512Mi 过小,无法支撑正常流量下的缓存开销
Evidence: 内存从 180Mi 线性增长至 500Mi 后崩溃,limit 为 512Mi,接近上限。
- lowHPA 达到最大副本数后,单 pod 负载增加导致内存压力
Evidence: HPA 已达 maxReplicas=12,无法水平扩容,剩余 pod 承担更多请求。
Investigation checklist
- 确认缓存 Map 的键值对数量及增长速率
kubectl exec -n prod -c order-svc <pod-name> -- sh -c 'curl -s localhost:8080/debug/vars | jq ".cache_size"'Expected: 返回当前缓存条目数,预期远大于 0 且持续增长
- 检查 v3.7 部署的变更详情,确认缓存实现
kubectl rollout history deployment/order-svc -n prod --revision=2 | grep -A5 "fast-json-stringify"Expected: 显示 changelog 中提及 'in-process request cache (Map-based, no eviction)'
- 查看最近 OOM pod 的堆转储,确认内存占用分布
kubectl exec -n prod -c order-svc <pod-name> -- sh -c 'jmap -histo:live 1 | head -30'Expected: 显示 java.util.HashMap 或类似条目占用大量内存
- 检查 HPA 配置及当前副本数
kubectl get hpa order-svc-hpa -n prod -o yaml | grep -E "maxReplicas|currentReplicas"Expected: maxReplicas=12, currentReplicas=12
Mitigation plan
临时增加内存 limit 至 1Gi,并重启所有 pod 以清除缓存
Risk: 可能掩盖根本问题,且增加资源争用风险;需确保集群有足够内存容量
Rollback: 将 memory limit 改回 512Mi,并重新部署 v3.6 版本
回滚部署至 v3.6(上一个稳定版本)
Risk: 回滚期间可能出现短暂不可用;需确保 v3.6 镜像可用
Rollback: 重新部署 v3.7 并应用缓存修复补丁
在缓存实现中添加 TTL 淘汰策略或限制最大条目数
Risk: 代码变更需要重新部署,可能引入新 bug
Rollback: 回滚至未修改的 v3.7 或 v3.6
Customer impact
部分用户在下单时遇到超时或失败,成功率从 99.7% 降至 96.8%。影响范围约为 3% 的请求。
Postmortem draft
Summary
order-svc 在 v3.7 部署后因内存泄漏导致 Pod OOMKilled,影响约 3% 请求。
Timeline (UTC)
- 04:00 - v3.7 部署完成
- 09:12 - 首个 Pod OOMKilled
- 09:14 - 告警触发
- 09:16 - 确认服务降级
- [FILL IN] - 缓解措施执行
Impact
- 错误率从 0.3% 升至 3.2%
- p99 延迟升高
- 3 个 Pod 处于 CrashLoopBackOff
Root Cause
v3.7 引入的进程内请求缓存(Map 实现,无淘汰策略)导致内存线性增长,最终触发 OOM。
Detection
Pod OOMKilled 事件触发告警,监控发现内存使用率持续上升。
Response
- 增加内存 limit 临时缓解
- 回滚至 v3.6
- 修复缓存实现
What Went Well
- 告警及时
- 堆转储提供了明确证据
What Went Poorly
- 代码审查未发现缓存无淘汰策略
- 缺乏内存使用趋势告警
Action Items
- [ ] 为缓存添加 TTL 或 LRU 淘汰策略
- [ ] 设置内存使用率趋势告警
- [ ] 在 staging 环境进行内存负载测试
Follow-ups
- P0为缓存添加 TTL 或 LRU 淘汰策略— service owner
- P1设置内存使用率趋势告警(如 80% limit 持续 5 分钟)— on-call SRE
- P1在 staging 环境进行内存负载测试,验证修复— service owner
- P2审查代码变更流程,增加内存相关检查项— platform team
Similar past incidents
lexical match (pg_trgm)
- 38%
[Eval][v1][en] Order service OOM crashloop following v3.7 deploy
Pods OOM-killed every ~20min, restart loop, p99 latency degraded, ~3% requests timing out
- 35%
[Eval][v2][en] Order service OOM crashloop following v3.7 deploy
Pods OOM-killed every ~20min, restart loop, p99 latency degraded, ~3% requests timing out
- 19%
[Eval][v2][zh] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 19%
[Eval][v1][zh] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 18%
[Eval][v1][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts