[Eval][v3][en] Order service OOM crashloop following v3.7 deploy
service: order-svccreated: 6/2/2026, 11:37:29 PM
Raw incident context
Time: 09:15 UTC. order-svc pods entering CrashLoopBackOff with OOMKilled exit reason. Kubernetes: - Pods: 12 total, 3 currently OOMKilled, 9 running - Memory limit: 512Mi, requests: 256Mi - Recent restarts/hour: 18 (up from baseline of 0–1) - HPA scaling has hit max replicas (12) Memory trend (last 24h): - Steady linear growth from 180Mi at 04:00 UTC to 500Mi at 09:00 UTC, then crash - Pre-deploy baseline: stable at 200Mi indefinitely Deploy history: - 04:00 UTC — order-svc v3.7 deployed (rolling, 100% complete by 04:08) - Changelog: "switched JSON parser to fast-json-stringify; added in-process request cache (Map-based, no eviction)" Logs: - No FATAL/ERROR pattern; pods exit silently - Heap profile from one captured pod shows: ~340Mi held by Map keyed on request ID, never cleared On-call: - 09:12 — first OOMKilled pod - 09:14 — pager - 09:16 — service still serving most traffic via remaining pods; checkout success rate dipped from 99.7% → 96.8%
Summary
order-svc pods are being OOMKilled every ~20 minutes due to a memory leak introduced in v3.7. A new in-process request cache (Map-based, no eviction) is holding ~340Mi of heap, causing pods to hit the 512Mi memory limit and crash. 3 of 12 pods are in CrashLoopBackOff, p99 latency is degraded, and checkout success rate dropped from 99.7% to 96.8%.
Severity reasoning: User-facing outage: checkout success rate dropped from 99.7% to 96.8% (error rate >1% for >5 min). Revenue path is broken (checkout failures). This meets SEV1 criteria.
deepseek-chat·prompt v3·output: en·10290ms·2007↑ / 1580↓ tok·$0.00228
Root cause hypotheses
- highMemory leak from in-process request cache introduced in v3.7 (Map-based, no eviction).
Evidence: Heap profile shows ~340Mi held by Map keyed on request ID, never cleared. Memory grows linearly from 180Mi at 04:00 UTC to 500Mi at 09:00 UTC, then crashes. Changelog confirms cache added at 04:00 UTC.
- mediumIncreased traffic causing more cache entries, exacerbating leak.
Evidence: Memory growth correlates with time since deploy, not traffic spikes. No traffic surge noted. Likely secondary factor.
- lowMemory limit too low for new cache overhead.
Evidence: Pre-deploy baseline was 200Mi stable. New cache adds ~340Mi, exceeding 512Mi limit. However, cache should have eviction; leak is the primary issue.
Investigation checklist
- Confirm memory leak by checking heap dump from a running pod.
kubectl exec -n prod -l app=order-svc -- jcmd 1 GC.heap_dump /tmp/heap.hprof && kubectl cp prod/$(kubectl get pod -n prod -l app=order-svc -o jsonpath='{.items[0].metadata.name}'):/tmp/heap.hprof ./heap.hprofExpected: Heap dump shows Map with many entries and no eviction logic.
- Check memory usage trend for all pods.
kubectl top pod -n prod -l app=order-svc --sort-by=memoryExpected: Running pods show memory usage near 500Mi, increasing over time.
- Verify cache implementation in v3.7 source code.
kubectl exec -n prod -l app=order-svc -- cat /app/app.js | grep -A 20 'cache'Expected: Map-based cache without eviction or size limit.
- Check if rolling back to v3.6 resolves the issue.
kubectl rollout undo deployment/order-svc -n prod --to-revision=2Expected: Pods stop OOMKilling, memory stabilizes at ~200Mi.
Mitigation plan
Rollback order-svc to v3.6 (previous stable version).
Risk: Brief period of mixed versions during rollout; potential cache invalidation issues. Low risk.
Rollback: Re-apply v3.7 deployment: kubectl rollout undo deployment/order-svc -n prod --to-revision=3
If rollback not possible, increase memory limit to 1Gi temporarily.
Risk: May mask leak; could cause node memory pressure. Higher blast radius.
Rollback: Revert limit to 512Mi: kubectl set resources deployment/order-svc -n prod --limits=memory=512Mi
Add eviction policy to cache (TTL or LRU) and redeploy as v3.8.
Risk: Requires code change and deployment; time to implement.
Rollback: Rollback to v3.6 if v3.8 fails.
Customer impact
Approximately 3% of checkout requests are timing out or failing. Users may see errors during payment processing. The issue started around 09:15 UTC and is ongoing.
Postmortem draft
Summary
order-svc experienced OOMKilled crashes due to a memory leak from an unbounded in-process request cache introduced in v3.7. Checkout success rate dropped from 99.7% to 96.8%.
Timeline (UTC)
- 04:00 - v3.7 deployed with new cache
- 09:12 - First OOMKilled pod
- 09:14 - Pager triggered
- 09:16 - Incident declared
- [FILL IN] - Rollback initiated
- [FILL IN] - Service stabilized
Impact
- 3 of 12 pods in CrashLoopBackOff
- Checkout success rate: 96.8% (vs 99.7% baseline)
- p99 latency degraded
Root Cause
Unbounded Map-based request cache without eviction, causing memory to grow linearly until OOM.
Detection
Alerted by OOMKilled pod events and pager. No prior warning from memory trend (linear growth missed).
Response
- Rollback to v3.6
- Increased memory limit temporarily
What Went Well
- Quick detection via pod crash alerts
- Rollback was straightforward
What Went Poorly
- No memory leak detection in staging
- Cache added without review of memory constraints
Action Items
- [ ] Add eviction to cache (P0)
- [ ] Add memory leak detection tests (P1)
- [ ] Set up memory usage alerts at 80% limit (P1)
Follow-ups
- P0Add eviction policy (TTL/LRU) to in-process cache in order-svc.— service owner
- P1Add memory leak detection test to CI pipeline (heap dump comparison).— platform team
- P1Set up alert for memory usage >80% of limit on order-svc pods.— on-call SRE
- P2Review all in-process caches across services for unbounded growth.— platform team
- P2Update deployment checklist to include memory profiling for new features.— service owner
Similar past incidents
lexical match (pg_trgm)
- 82%
[Eval][v1][en] Order service OOM crashloop following v3.7 deploy
Pods OOM-killed every ~20min, restart loop, p99 latency degraded, ~3% requests timing out
- 75%
[Eval][v2][en] Order service OOM crashloop following v3.7 deploy
Pods OOM-killed every ~20min, restart loop, p99 latency degraded, ~3% requests timing out
- 32%
[Eval][v2][en] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 30%
[Eval][v1][en] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 29%
[Eval][v2][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts