[Eval][v2][en] Order service OOM crashloop following v3.7 deploy
service: order-svccreated: 5/25/2026, 10:21:02 PM
Raw incident context
Time: 09:15 UTC. order-svc pods entering CrashLoopBackOff with OOMKilled exit reason. Kubernetes: - Pods: 12 total, 3 currently OOMKilled, 9 running - Memory limit: 512Mi, requests: 256Mi - Recent restarts/hour: 18 (up from baseline of 0–1) - HPA scaling has hit max replicas (12) Memory trend (last 24h): - Steady linear growth from 180Mi at 04:00 UTC to 500Mi at 09:00 UTC, then crash - Pre-deploy baseline: stable at 200Mi indefinitely Deploy history: - 04:00 UTC — order-svc v3.7 deployed (rolling, 100% complete by 04:08) - Changelog: "switched JSON parser to fast-json-stringify; added in-process request cache (Map-based, no eviction)" Logs: - No FATAL/ERROR pattern; pods exit silently - Heap profile from one captured pod shows: ~340Mi held by Map keyed on request ID, never cleared On-call: - 09:12 — first OOMKilled pod - 09:14 — pager - 09:16 — service still serving most traffic via remaining pods; checkout success rate dipped from 99.7% → 96.8%
Summary
order-svc pods are being OOMKilled every ~20 minutes due to a memory leak introduced in v3.7. The leak is caused by an in-process request cache (Map-based, no eviction) that grows unboundedly. 3 of 12 pods are in CrashLoopBackOff, p99 latency is degraded, and checkout success rate dropped from 99.7% to 96.8%. Immediate mitigation is to roll back the deployment or reduce memory pressure.
Severity reasoning: User-facing outage: error rate > 1% (3% requests timing out, checkout success rate 96.8% < 99%) for >5 min, revenue path broken (checkout failures), and risk of full outage if remaining pods also OOM.
deepseek-chat·prompt v2·output: en·11936ms
Root cause hypotheses
- highIn-process request cache (Map-based, no eviction) introduced in v3.7 causes unbounded memory growth.
Evidence: Heap profile from captured pod shows ~340Mi held by Map keyed on request ID, never cleared. Memory trend shows linear growth from 180Mi to 500Mi over 5 hours, then crash.
- mediumMemory limit (512Mi) is too low for the new JSON parser (fast-json-stringify) which may allocate more memory per request.
Evidence: Memory baseline was stable at 200Mi before v3.7. After deploy, memory grows to 500Mi before OOM. However, heap profile points to Map, not parser.
- lowA sudden traffic spike increased concurrent requests, causing the cache to grow faster.
Evidence: No traffic spike reported; HPA at max replicas suggests sustained load but not a spike. Memory growth is linear over hours, not stepwise.
Investigation checklist
- Check memory usage of running pods to confirm leak pattern.
kubectl top pod -n prod -l app=order-svc --sort-by=memory --no-headers | head -10Expected: Memory usage near 500Mi for pods running longest, lower for recently restarted.
- Inspect heap dump from a captured pod to identify largest allocations.
kubectl exec -n prod -c order-svc <pod-name> -- jcmd 1 GC.heap_dump /tmp/heap.hprof && kubectl cp -n prod <pod-name>:/tmp/heap.hprof ./heap.hprof && jhat heap.hprofExpected: Map entries (request ID keys) consuming ~340Mi.
- Check if the cache has any eviction or TTL logic in code.
grep -r 'evict\|TTL\|expire\|clear' /deploy/order-svc-v3.7/src/cache.jsExpected: No eviction logic found.
- Verify that the JSON parser change is not the primary cause by checking memory allocation per request.
kubectl exec -n prod -c order-svc <pod-name> -- curl -s localhost:8080/debug/vars | jq '.memstats.HeapAlloc'Expected: HeapAlloc grows over time, not per-request spike.
- Check if there is a memory leak in the JSON parser by comparing before/after deploy.
kubectl logs -n prod -l app=order-svc --since=6h | grep -i 'json\|parse' | tail -20Expected: No error logs; parser is not the source.
Mitigation plan
Roll back order-svc to v3.6 (previous stable version) to remove the memory leak.
Risk: Rollback may cause brief downtime during pod restart. Ensure traffic drains properly.
Rollback: Re-deploy v3.7 by running: kubectl rollout undo deployment/order-svc -n prod
If rollback is not immediately possible, increase memory limit to 1Gi to buy time.
Risk: Higher limit may mask the leak and cause node memory pressure. Monitor node memory.
Rollback: Revert memory limit to 512Mi: kubectl set resources deployment/order-svc -n prod --limits=memory=512Mi
Scale down HPA max replicas to 8 to reduce total memory consumption.
Risk: May increase latency under load; could cause overload on remaining pods.
Rollback: Set HPA max back to 12: kubectl autoscale deployment order-svc -n prod --max=12
Customer impact
Approximately 25% of users may experience timeouts or errors during checkout. Checkout success rate dropped from 99.7% to 96.8%. Affected users see 'Something went wrong' or slow page loads. No data loss expected.
Postmortem draft
Summary
[FILL IN]
Timeline (UTC)
- 04:00 - order-svc v3.7 deployed
- 09:12 - First OOMKilled pod
- 09:14 - Pager triggered
- 09:16 - Incident declared, checkout success rate at 96.8%
- [FILL IN] - Rollback initiated
- [FILL IN] - All pods stable
Impact
- 3 of 12 pods in CrashLoopBackOff
- p99 latency degraded
- Checkout success rate dropped from 99.7% to 96.8%
- ~3% requests timing out
Root Cause
Memory leak in in-process request cache (Map-based, no eviction) introduced in v3.7. Cache grows unboundedly until OOM.
Detection
Pager alert for OOMKilled pods. No prior warning because memory growth was gradual.
Response
- Rolled back to v3.6
- Increased memory limit temporarily
- Monitored memory usage
What Went Well
- Quick rollback decision
- Heap dump analysis identified root cause
What Went Poorly
- No memory limit monitoring before deploy
- Cache was not reviewed for eviction
- No canary deployment
Action Items
- [FILL IN]
Follow-ups
- P1Add memory usage alert for order-svc pods at 80% of limit.— on-call SRE
- P1Implement eviction policy (TTL or LRU) for in-process cache.— service owner
- P2Add canary deployment for future releases.— platform team
- P2Review all in-process caches across services for unbounded growth.— service owner
- P2Add heap dump collection on OOM to automate root cause analysis.— platform team
Similar past incidents
lexical match (pg_trgm)
- 63%
[Eval][v1][en] Order service OOM crashloop following v3.7 deploy
Pods OOM-killed every ~20min, restart loop, p99 latency degraded, ~3% requests timing out
- 32%
[Eval][v2][en] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 30%
[Eval][v1][en] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 29%
[Eval][v2][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 29%
[datadog] checkout p99 8s
checkout p99 8s