[Eval][v1][en] Order service OOM crashloop following v3.7 deploy
service: order-svccreated: 6/2/2026, 11:51:31 PM
Raw incident context
Time: 09:15 UTC. order-svc pods entering CrashLoopBackOff with OOMKilled exit reason. Kubernetes: - Pods: 12 total, 3 currently OOMKilled, 9 running - Memory limit: 512Mi, requests: 256Mi - Recent restarts/hour: 18 (up from baseline of 0–1) - HPA scaling has hit max replicas (12) Memory trend (last 24h): - Steady linear growth from 180Mi at 04:00 UTC to 500Mi at 09:00 UTC, then crash - Pre-deploy baseline: stable at 200Mi indefinitely Deploy history: - 04:00 UTC — order-svc v3.7 deployed (rolling, 100% complete by 04:08) - Changelog: "switched JSON parser to fast-json-stringify; added in-process request cache (Map-based, no eviction)" Logs: - No FATAL/ERROR pattern; pods exit silently - Heap profile from one captured pod shows: ~340Mi held by Map keyed on request ID, never cleared On-call: - 09:12 — first OOMKilled pod - 09:14 — pager - 09:16 — service still serving most traffic via remaining pods; checkout success rate dipped from 99.7% → 96.8%
Summary
order-svc pods are being OOMKilled every ~20 minutes due to a memory leak introduced in v3.7. A new in-process request cache (Map-based, no eviction) is holding ~340Mi of heap, causing pods to hit the 512Mi limit and crash. 3 of 12 pods are in CrashLoopBackOff, p99 latency is degraded, and checkout success rate dropped from 99.7% to 96.8%.
Severity reasoning: User-facing degradation (3% request timeouts, checkout success rate drop) but not a full outage; revenue impact likely but limited. Scope: subset of users hitting degraded pods. Reversibility: rollback or feature flag disable can stop the leak.
deepseek-chat·prompt v1·output: en·10499ms·1375↑ / 1454↓ tok·$0.00197
Root cause hypotheses
- highIn-process request cache (Map-based, no eviction) introduced in v3.7 grows unbounded, consuming all heap memory.
Evidence: Heap profile from captured pod shows ~340Mi held by Map keyed on request ID, never cleared. Memory trend shows linear growth from 180Mi to 500Mi over 5 hours post-deploy.
- mediumJSON parser switch to fast-json-stringify causes memory fragmentation or retention of large strings.
Evidence: Changelog mentions parser switch; however heap profile points to Map, not string buffers. Likely secondary factor.
- lowNode-level memory pressure or cgroup misconfiguration causing early OOM.
Evidence: Memory limit is 512Mi, pods reach 500Mi then crash. Other pods on same node not affected. Unlikely.
Investigation checklist
- Confirm the memory leak source by inspecting heap dump from a running pod.
kubectl exec -n <namespace> <pod-name> -- jcmd 1 GC.heap_dump /tmp/heap.hprof && kubectl cp <namespace>/<pod-name>:/tmp/heap.hprof ./heap.hprofExpected: Heap dump shows Map entries with request IDs as keys, no eviction logic.
- Check if the cache is configurable via feature flag or environment variable.
kubectl get deployment order-svc -o yaml | grep -i -E 'feature|flag|cache|evict'Expected: No feature flag found, or flag exists but is enabled.
- Verify memory usage trend across all pods to confirm leak is uniform.
kubectl top pod -n <namespace> --sort-by=memory | grep order-svcExpected: All pods show memory usage increasing over time, similar slope.
- Check if rolling back to v3.6 resolves the issue in a canary pod.
kubectl set image deployment/order-svc-canary order-svc=order-svc:v3.6Expected: Canary pod memory stabilizes at ~200Mi.
Mitigation plan
Disable the in-process request cache via feature flag if available, or roll back to v3.6.
Risk: Rollback may cause brief connection draining; feature flag disable may increase latency for uncached requests.
Rollback: Re-enable cache flag or redeploy v3.7.
Increase memory limit to 1Gi temporarily to buy time for root cause fix.
Risk: Higher memory usage may cause node pressure; ensure cluster capacity.
Rollback: Revert limit to 512Mi after fix.
Add a TTL-based eviction to the cache and redeploy as v3.7.1.
Risk: New code may introduce bugs; test in staging first.
Rollback: Revert to v3.6.
Customer impact
Approximately 3% of checkout requests are timing out, and p99 latency is degraded. Users may see intermittent errors during checkout. No data loss. ETA for full mitigation: 1 hour for rollback, 4 hours for patched release.
Postmortem draft
Postmortem: order-svc OOMKilled
Summary: order-svc v3.7 introduced a memory leak via an unbounded in-process request cache, causing pods to OOM every ~20 minutes.
Timeline:
- 04:00 UTC: v3.7 deployed
- 09:12: First OOMKilled pod
- 09:14: Pager triggered
- 09:16: Incident declared
- [Mitigation time]: Rollback to v3.6 / cache disabled
Impact: 3% request timeouts, checkout success rate 96.8% (down from 99.7%), 3 pods in CrashLoopBackOff.
Root Cause: Map-based request cache without eviction grew unbounded, consuming heap until OOM.
What Went Well: Heap profile captured quickly; memory trend visible.
What Went Poorly: No feature flag for cache; no memory limit alert before crash; code review missed missing eviction.
Action Items:
- [P0] Add TTL eviction to cache
- [P1] Add memory usage alert at 80% of limit
- [P1] Require feature flags for new in-memory caches
- [P2] Review code for similar patterns
Follow-ups
- P0Add TTL eviction to request cache and release v3.7.1— service owner
- P1Set up memory usage alert at 80% of limit for order-svc— on-call SRE
- P1Require feature flags for any new in-memory cache in code review— platform team
- P2Review all services for similar unbounded caches— service owner
Similar past incidents
lexical match (pg_trgm)
- 83%
[Eval][v1][en] Order service OOM crashloop following v3.7 deploy
Pods OOM-killed every ~20min, restart loop, p99 latency degraded, ~3% requests timing out
- 74%
[Eval][v2][en] Order service OOM crashloop following v3.7 deploy
Pods OOM-killed every ~20min, restart loop, p99 latency degraded, ~3% requests timing out
- 31%
[Eval][v2][en] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 30%
[Eval][v1][en] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 29%
[Eval][v2][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts