[Eval][v1][en] Order service OOM crashloop following v3.7 deploy
service: order-svccreated: 6/2/2026, 11:51:02 PM
Raw incident context
Time: 09:15 UTC. order-svc pods entering CrashLoopBackOff with OOMKilled exit reason. Kubernetes: - Pods: 12 total, 3 currently OOMKilled, 9 running - Memory limit: 512Mi, requests: 256Mi - Recent restarts/hour: 18 (up from baseline of 0–1) - HPA scaling has hit max replicas (12) Memory trend (last 24h): - Steady linear growth from 180Mi at 04:00 UTC to 500Mi at 09:00 UTC, then crash - Pre-deploy baseline: stable at 200Mi indefinitely Deploy history: - 04:00 UTC — order-svc v3.7 deployed (rolling, 100% complete by 04:08) - Changelog: "switched JSON parser to fast-json-stringify; added in-process request cache (Map-based, no eviction)" Logs: - No FATAL/ERROR pattern; pods exit silently - Heap profile from one captured pod shows: ~340Mi held by Map keyed on request ID, never cleared On-call: - 09:12 — first OOMKilled pod - 09:14 — pager - 09:16 — service still serving most traffic via remaining pods; checkout success rate dipped from 99.7% → 96.8%
Summary
order-svc pods are OOM-killed every ~20 minutes due to a memory leak introduced in v3.7. A new in-process request cache (Map-based, no eviction) grows unboundedly, consuming all available memory. 3 of 12 pods are in CrashLoopBackOff, p99 latency degraded, and checkout success rate dropped from 99.7% to 96.8%.
Severity reasoning: Partial user-facing degradation (3% timeout rate, checkout success rate drop) with revenue impact potential. Full outage avoided by remaining pods, but risk of cascading failure if all pods OOM. Not SEV1 because service is still serving most traffic.
deepseek-chat·prompt v1·output: en·9783ms·1375↑ / 1415↓ tok·$0.00193
Root cause hypotheses
- highIn-process request cache (Map) introduced in v3.7 has no eviction policy, causing unbounded memory growth.
Evidence: Heap profile shows ~340Mi held by Map keyed on request ID, never cleared. Memory grows linearly from 180Mi to 500Mi over 5 hours post-deploy.
- lowfast-json-stringify library has a memory leak in certain payload patterns.
Evidence: Changelog mentions switching JSON parser; however, heap profile points to Map, not stringify buffers. Less likely.
- mediumIncreased traffic volume causing more concurrent requests, but cache still should be bounded.
Evidence: HPA at max replicas (12) suggests high load, but memory growth is linear and correlated with deploy time, not traffic spikes.
Investigation checklist
- Confirm memory leak source by inspecting heap dump from a running pod.
kubectl exec -n <namespace> <pod-name> -- jcmd 1 GC.heap_dump /tmp/heap.hprof && kubectl cp <namespace>/<pod-name>:/tmp/heap.hprof ./heap.hprofExpected: Heap dump shows Map entries with request IDs as keys, no eviction logic.
- Check if the cache is being cleared on response or has a TTL.
kubectl exec -n <namespace> <pod-name> -- grep -r 'cache' /app --include='*.js' | grep -i 'delete|clear|evict|ttl'Expected: No eviction or TTL logic found for the cache.
- Verify memory limit and current usage across all pods.
kubectl top pod -n <namespace> -l app=order-svc --sort-by=memoryExpected: Memory usage near limit (512Mi) for most pods, with some already OOM.
- Check if rolling back to previous version resolves the issue.
kubectl rollout undo deployment/order-svc -n <namespace>Expected: Pods restart with old image, memory stabilizes at ~200Mi.
Mitigation plan
Rollback order-svc to v3.6 (previous stable version) immediately.
Risk: Brief increase in latency during rollback; no data loss. Safe operation.
Rollback: Re-deploy v3.7 if rollback causes worse issues (unlikely).
If rollback not possible, increase memory limit to 1Gi temporarily and add a TTL-based eviction to the cache.
Risk: Higher resource usage; may still OOM if leak continues. Cache fix requires code change and redeploy.
Rollback: Revert memory limit to 512Mi after fix is deployed.
Customer impact
Approximately 3% of checkout requests are timing out, affecting a subset of users. Remaining pods serve most traffic. Checkout success rate dropped from 99.7% to 96.8%. No data loss reported.
Postmortem draft
Postmortem: order-svc OOMKilled due to unbounded cache
Summary
- Date: 2025-04-10
- Duration: 09:12 UTC – mitigation at 09:XX UTC
- Severity: SEV2
- Impact: 3% timeout rate, checkout success rate 96.8%
Timeline
- 04:00 UTC: Deploy v3.7 (fast-json-stringify + in-process cache)
- 09:12 UTC: First OOMKilled pod
- 09:14 UTC: Pager triggered
- 09:16 UTC: Incident declared
- 09:XX UTC: Rollback initiated
Impact
- Users: ~3% checkout failures
- Revenue: potential loss from failed checkouts
Root Cause
In-process request cache (Map) with no eviction policy introduced in v3.7 caused unbounded memory growth.
What Went Well
- Monitoring detected OOM quickly
- Rollback was straightforward
What Went Poorly
- Code review missed missing eviction logic
- No memory leak detection in staging
Action Items
- [ ] Add TTL/eviction to cache (P0)
- [ ] Add memory leak detection in CI (P1)
- [ ] Review code review checklist for resource management (P2)
Follow-ups
- P0Add eviction policy (TTL or LRU) to the in-process cache in order-svc.— service owner
- P1Set up memory usage alerts and heap dump on OOM for faster diagnosis.— on-call SRE
- P1Add load testing with memory profiling to staging deployment pipeline.— platform team
- P2Review code review checklist to include resource management (caches, connections).— service owner
Similar past incidents
lexical match (pg_trgm)
- 78%
[Eval][v2][en] Order service OOM crashloop following v3.7 deploy
Pods OOM-killed every ~20min, restart loop, p99 latency degraded, ~3% requests timing out
- 67%
[Eval][v1][en] Order service OOM crashloop following v3.7 deploy
Pods OOM-killed every ~20min, restart loop, p99 latency degraded, ~3% requests timing out
- 31%
[Eval][v2][en] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 31%
[Eval][v1][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 30%
[Eval][v2][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts