[Eval][v1][en] Order service OOM crashloop following v3.7 deploy
service: order-svccreated: 6/2/2026, 11:36:32 PM
Raw incident context
Time: 09:15 UTC. order-svc pods entering CrashLoopBackOff with OOMKilled exit reason. Kubernetes: - Pods: 12 total, 3 currently OOMKilled, 9 running - Memory limit: 512Mi, requests: 256Mi - Recent restarts/hour: 18 (up from baseline of 0–1) - HPA scaling has hit max replicas (12) Memory trend (last 24h): - Steady linear growth from 180Mi at 04:00 UTC to 500Mi at 09:00 UTC, then crash - Pre-deploy baseline: stable at 200Mi indefinitely Deploy history: - 04:00 UTC — order-svc v3.7 deployed (rolling, 100% complete by 04:08) - Changelog: "switched JSON parser to fast-json-stringify; added in-process request cache (Map-based, no eviction)" Logs: - No FATAL/ERROR pattern; pods exit silently - Heap profile from one captured pod shows: ~340Mi held by Map keyed on request ID, never cleared On-call: - 09:12 — first OOMKilled pod - 09:14 — pager - 09:16 — service still serving most traffic via remaining pods; checkout success rate dipped from 99.7% → 96.8%
Summary
order-svc pods are OOM-killed every ~20 minutes due to a memory leak introduced in v3.7. A new in-process request cache (Map-based, no eviction) holds ~340Mi of heap, causing pods to hit the 512Mi limit and crash. 3 of 12 pods are in CrashLoopBackOff; p99 latency is degraded and ~3% of requests time out.
Severity reasoning: Significant degradation for a subset of users (3% timeout, p99 latency degraded) but not a full outage; revenue impact possible but not confirmed. Scope is limited to order-svc; other services unaffected.
deepseek-chat·prompt v1·output: en·9938ms·1375↑ / 1474↓ tok·$0.00199
Root cause hypotheses
- highMemory leak from unbounded in-process request cache introduced in v3.7
Evidence: Heap profile shows ~340Mi held by Map keyed on request ID, never cleared. Changelog confirms 'added in-process request cache (Map-based, no eviction)'.
- mediumIncreased request rate causing cache to grow faster than expected
Evidence: Memory growth linear from 180Mi to 500Mi over 5 hours; pre-deploy baseline stable at 200Mi. Request rate may have increased, but cache has no eviction so it grows unbounded regardless.
- lowMemory fragmentation due to fast-json-stringify library
Evidence: Changelog mentions switching JSON parser to fast-json-stringify. However, heap profile points to Map, not JSON buffers.
Investigation checklist
- Confirm the cache is the primary memory consumer by examining heap dump from a running pod
kubectl exec -n <namespace> <pod-name> -- jcmd 1 GC.heap_dump /tmp/heap.hprof && kubectl cp <namespace>/<pod-name>:/tmp/heap.hprof ./heap.hprofExpected: Heap dump shows Map with ~340Mi retained; no other large objects.
- Check if the cache has any eviction or size limit in code
grep -r 'Map' --include='*.js' ./order-svc/ | grep -i 'cache'Expected: No eviction logic found; Map is used without size limit or TTL.
- Verify memory limit and request settings for pods
kubectl describe pod -n <namespace> -l app=order-svc | grep -A2 'Limits'Expected: Memory limit: 512Mi, request: 256Mi.
- Check if HPA is at max replicas and if scaling would help
kubectl get hpa -n <namespace> order-svcExpected: Current replicas: 12/12 (max). Scaling won't help because each pod leaks memory.
Mitigation plan
Rollback order-svc to v3.6 (pre-cache version) to stop memory leak immediately
Risk: Brief period of no cache; may increase latency slightly but avoids OOM. Rollback is safe as v3.6 was stable.
Rollback: Re-deploy v3.7 if rollback causes issues; but v3.6 was stable, so unlikely.
If rollback not possible, increase memory limit to 1Gi as temporary workaround
Risk: Higher resource usage; may cause node pressure. Only buy time until fix is deployed.
Rollback: Revert limit to 512Mi after cache fix is deployed.
Add eviction policy to cache (e.g., TTL or LRU) and deploy v3.8
Risk: Cache miss rate may increase; but prevents unbounded growth.
Rollback: Revert to v3.7 if cache eviction causes performance regression.
Customer impact
Approximately 3% of checkout requests are timing out; p99 latency is degraded. Users may experience slow or failed checkout. No data loss. ETA for full recovery: 30 minutes after rollback.
Postmortem draft
Postmortem: order-svc OOMKilled due to memory leak
Summary
- Date: YYYY-MM-DD
- Duration: 09:12 UTC – 09:42 UTC (30 min)
- Severity: SEV2
- Impact: 3% checkout timeout, p99 latency degraded
Timeline
- 04:00 UTC: Deploy v3.7 (cache added)
- 09:12 UTC: First OOMKilled pod
- 09:14 UTC: Pager triggered
- 09:16 UTC: Incident declared
- 09:20 UTC: Rollback initiated
- 09:42 UTC: All pods stable, metrics normal
Root Cause
Unbounded in-process request cache (Map) introduced in v3.7 without eviction. Cache grew until memory limit exceeded, causing OOM.
What Went Well
- Quick detection via pager
- Heap dump available
- Rollback was fast
What Went Poorly
- No code review caught missing eviction
- No memory limit testing before deploy
- HPA max replicas masked issue
Action Items
- [ ] Add TTL eviction to cache (P0)
- [ ] Add memory usage alert for order-svc (P1)
- [ ] Add load test with memory profiling to CI (P2)
Follow-ups
- P0Add TTL eviction to in-process cache and deploy fix— service owner
- P1Add memory usage alert (e.g., >80% limit) for order-svc— on-call SRE
- P2Add load test with memory profiling to CI pipeline— platform team
- P2Review code review process for missing eviction logic— service owner
Similar past incidents
lexical match (pg_trgm)
- 94%
[Eval][v1][en] Order service OOM crashloop following v3.7 deploy
Pods OOM-killed every ~20min, restart loop, p99 latency degraded, ~3% requests timing out
- 65%
[Eval][v2][en] Order service OOM crashloop following v3.7 deploy
Pods OOM-killed every ~20min, restart loop, p99 latency degraded, ~3% requests timing out
- 27%
[Eval][v2][en] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 27%
[Eval][v2][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 26%
[Eval][v2][en] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.