[Eval][v2][en] Order service OOM crashloop following v3.7 deploy
service: order-svccreated: 6/2/2026, 11:52:55 PM
Raw incident context
Time: 09:15 UTC. order-svc pods entering CrashLoopBackOff with OOMKilled exit reason. Kubernetes: - Pods: 12 total, 3 currently OOMKilled, 9 running - Memory limit: 512Mi, requests: 256Mi - Recent restarts/hour: 18 (up from baseline of 0–1) - HPA scaling has hit max replicas (12) Memory trend (last 24h): - Steady linear growth from 180Mi at 04:00 UTC to 500Mi at 09:00 UTC, then crash - Pre-deploy baseline: stable at 200Mi indefinitely Deploy history: - 04:00 UTC — order-svc v3.7 deployed (rolling, 100% complete by 04:08) - Changelog: "switched JSON parser to fast-json-stringify; added in-process request cache (Map-based, no eviction)" Logs: - No FATAL/ERROR pattern; pods exit silently - Heap profile from one captured pod shows: ~340Mi held by Map keyed on request ID, never cleared On-call: - 09:12 — first OOMKilled pod - 09:14 — pager - 09:16 — service still serving most traffic via remaining pods; checkout success rate dipped from 99.7% → 96.8%
Summary
order-svc pods are being OOMKilled every ~20 minutes due to a memory leak introduced in v3.7. The leak is caused by an unbounded in-process request cache (Map-based, no eviction) that grows linearly with request volume. 3 of 12 pods are in CrashLoopBackOff, p99 latency is degraded, and checkout success rate dropped from 99.7% to 96.8%.
Severity reasoning: User-facing outage: checkout success rate >1% (3.2% error rate) for >5 minutes, revenue path broken (checkout failures), and error rate exceeds 1% threshold. Meets SEV1 criteria.
deepseek-chat·prompt v2·output: en·10284ms·1756↑ / 1524↓ tok·$0.00215
Root cause hypotheses
- highUnbounded in-process request cache introduced in v3.7 causes memory leak
Evidence: Heap profile shows ~340Mi held by Map keyed on request ID, never cleared. Memory grows linearly from 180Mi to 500Mi over 5 hours, then OOM. Changelog confirms 'Map-based, no eviction'.
- mediumMemory limit too low for normal traffic after v3.7 changes
Evidence: Memory limit is 512Mi, but pre-deploy baseline was 200Mi. New JSON parser may increase per-request memory, but heap profile shows cache dominates.
- lowExternal dependency (e.g., database) causing request accumulation
Evidence: No slow queries or connection pool exhaustion in logs. Cache is in-process, not external.
Investigation checklist
- Check memory usage trend per pod
kubectl top pod -n prod -l app=order-svc --sort-by=memory | head -15Expected: Memory usage near 500Mi for running pods, confirming leak
- Inspect heap dump for cache size
kubectl exec -n prod deploy/order-svc -- sh -c 'jcmd 1 GC.heap_dump /tmp/heap.hprof && jhat /tmp/heap.hprof'Expected: Map entries with request IDs, no eviction logic
- Check deploy rollout status
kubectl rollout status deploy/order-svc -n prod --revision=currentExpected: v3.7 deployed at 04:00 UTC
- Check HPA metrics
kubectl describe hpa order-svc -n prodExpected: Current replicas at max (12), memory utilization > 90%
- Check if cache is configurable via feature flag
kubectl exec -n prod deploy/order-svc -- env | grep CACHEExpected: No env var for cache eviction or size limit
Mitigation plan
Rollback order-svc to v3.6 (pre-cache version) to stop memory leak immediately
Risk: Brief traffic disruption during rollback; v3.6 may have different JSON parser but no cache. Safe rollback via Kubernetes rolling update.
Rollback: Re-deploy v3.7 if rollback causes issues (unlikely). Command: kubectl rollout undo deploy/order-svc -n prod
If rollback not possible, scale up replicas temporarily to reduce per-pod memory pressure
Risk: Only delays OOM; does not fix leak. May hit cluster resource limits.
Rollback: Scale back down after fix. Command: kubectl scale deploy/order-svc -n prod --replicas=20
Add memory-based HPA to auto-scale before OOM
Risk: May cause rapid scaling; still need fix.
Rollback: Remove HPA. Command: kubectl delete hpa order-svc -n prod
Customer impact
Approximately 25% of checkout requests are failing or timing out. Users may see 'order failed' errors or slow page loads. Estimated 3% of all traffic affected. No data loss expected.
Postmortem draft
Summary
[FILL IN]
Timeline (UTC)
- 04:00: order-svc v3.7 deployed with unbounded cache
- 09:12: First OOMKilled pod
- 09:14: Pager triggered
- 09:16: Incident declared (SEV1)
- [FILL IN] Rollback initiated
- [FILL IN] Service restored
Impact
- Checkout success rate dropped from 99.7% to 96.8% for ~20 minutes
- 3 of 12 pods in CrashLoopBackOff
- p99 latency degraded
Root Cause
Memory leak from unbounded in-process request cache (Map-based, no eviction) introduced in v3.7.
Detection
Pager triggered by OOMKilled pod alerts. Success rate drop detected via monitoring.
Response
- Rollback to v3.6
- [FILL IN] Additional steps
What Went Well
- Quick detection via pod OOM alerts
- Heap dump analysis identified root cause
What Went Poorly
- No memory leak testing in staging
- No cache eviction or size limit
- HPA max replicas reached
Action Items
- [FILL IN] Add cache eviction (TTL or LRU)
- [FILL IN] Add memory limit alerts at 80%
- [FILL IN] Add load test with sustained traffic
Follow-ups
- P0Add TTL-based eviction to in-process cache— service owner
- P1Set memory limit alert at 80% (400Mi) for order-svc— on-call SRE
- P1Add load test with sustained traffic to staging— platform team
- P2Review code review process for memory-sensitive changes— service owner
- P2Increase HPA max replicas to 20— platform team
Similar past incidents
lexical match (pg_trgm)
- 85%
[Eval][v2][en] Order service OOM crashloop following v3.7 deploy
Pods OOM-killed every ~20min, restart loop, p99 latency degraded, ~3% requests timing out
- 65%
[Eval][v1][en] Order service OOM crashloop following v3.7 deploy
Pods OOM-killed every ~20min, restart loop, p99 latency degraded, ~3% requests timing out
- 31%
[Eval][v2][en] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 29%
[Eval][v2][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 28%
[Eval][v1][en] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted