[Eval][v1][en] Order service OOM crashloop following v3.7 deploy
service: order-svccreated: 6/2/2026, 11:51:16 PM
Raw incident context
Time: 09:15 UTC. order-svc pods entering CrashLoopBackOff with OOMKilled exit reason. Kubernetes: - Pods: 12 total, 3 currently OOMKilled, 9 running - Memory limit: 512Mi, requests: 256Mi - Recent restarts/hour: 18 (up from baseline of 0–1) - HPA scaling has hit max replicas (12) Memory trend (last 24h): - Steady linear growth from 180Mi at 04:00 UTC to 500Mi at 09:00 UTC, then crash - Pre-deploy baseline: stable at 200Mi indefinitely Deploy history: - 04:00 UTC — order-svc v3.7 deployed (rolling, 100% complete by 04:08) - Changelog: "switched JSON parser to fast-json-stringify; added in-process request cache (Map-based, no eviction)" Logs: - No FATAL/ERROR pattern; pods exit silently - Heap profile from one captured pod shows: ~340Mi held by Map keyed on request ID, never cleared On-call: - 09:12 — first OOMKilled pod - 09:14 — pager - 09:16 — service still serving most traffic via remaining pods; checkout success rate dipped from 99.7% → 96.8%
Summary
order-svc pods are being OOMKilled every ~20 minutes due to a memory leak introduced in v3.7. The leak is caused by an in-process request cache (Map-based, no eviction) that grows unboundedly, consuming ~340Mi per pod. 3 of 12 pods are in CrashLoopBackOff, p99 latency is degraded, and checkout success rate dropped from 99.7% to 96.8%.
Severity reasoning: User-facing degradation (3% request timeouts, checkout success rate drop) but not a full outage; revenue impact likely but limited. Scope: subset of users hitting affected pods. Reversibility: rollback or cache fix can restore service.
deepseek-chat·prompt v1·output: en·10984ms·1375↑ / 1550↓ tok·$0.00208
Root cause hypotheses
- highIn-process request cache (Map) introduced in v3.7 has no eviction policy, causing unbounded memory growth.
Evidence: Heap profile shows ~340Mi held by Map keyed on request ID; memory grows linearly from 180Mi to 500Mi over 5 hours post-deploy.
- mediumfast-json-stringify library has a memory leak in certain payload patterns.
Evidence: Changelog mentions switching JSON parser; no direct evidence of leak, but possible. Memory growth correlates with deploy time.
- lowNode-level memory pressure or kernel bug causing OOM kills at lower thresholds.
Evidence: Memory limit is 512Mi; pods consistently crash at ~500Mi. No other workloads on node show similar behavior.
Investigation checklist
- Check memory usage per pod and identify pods with highest memory.
kubectl top pod -n <namespace> --selector=app=order-svc --sort-by=memoryExpected: Pods near 500Mi memory usage; running pods show >400Mi.
- Verify the in-process cache is the source of memory leak by inspecting heap dump from a running pod.
kubectl exec <pod> -- jcmd 1 GC.heap_dump /tmp/heap.hprof && kubectl cp <pod>:/tmp/heap.hprof ./heap.hprofExpected: Heap dump shows Map entries with request IDs, no eviction logic.
- Check if the cache has a size limit or TTL in code.
grep -r 'Map' /path/to/order-svc/src/ --include='*.js' | grep -i 'cache'Expected: No eviction or size limit found; Map is used without cleanup.
- Confirm the deploy time correlation with memory growth.
kubectl rollout history deployment/order-svcExpected: Revision 3.7 deployed at 04:00 UTC.
- Check if fast-json-stringify has known memory issues.
npm view fast-json-stringify versions --json | tail -5Expected: Version used is latest; no known CVEs but check GitHub issues.
Mitigation plan
Rollback order-svc to previous version v3.6 to immediately stop memory leak.
Risk: Rollback may cause brief downtime or incompatibility with downstream services; rolling update is safe.
Rollback: Re-deploy v3.7 using 'kubectl rollout undo deployment/order-svc'.
If rollback not possible, increase memory limit to 1Gi as temporary workaround.
Risk: Higher memory usage may cause node pressure; not a fix for leak.
Rollback: Revert memory limit to 512Mi after fix.
Scale up HPA max replicas to 15 to distribute load while pods restart.
Risk: More pods may increase database connection pool pressure.
Rollback: Revert HPA max replicas to 12 after stabilization.
Customer impact
Approximately 3% of checkout requests are timing out or failing. Users may see errors during payment processing. Service is partially degraded; no data loss expected.
Postmortem draft
Postmortem: order-svc OOMKilled due to memory leak
Summary
- Date: YYYY-MM-DD
- Duration: 09:12 UTC - resolved at XX:XX UTC
- Severity: SEV2
Timeline
- 04:00 UTC: Deploy v3.7 (in-process cache, JSON parser change)
- 09:12 UTC: First OOMKilled pod
- 09:14 UTC: Pager triggered
- 09:16 UTC: On-call acknowledges
- XX:XX UTC: Rollback initiated
- XX:XX UTC: Service stable
Impact
- 3% request timeouts
- Checkout success rate: 99.7% → 96.8%
- 3 pods in CrashLoopBackOff
Root Cause
In-process request cache (Map) with no eviction policy introduced in v3.7 caused unbounded memory growth.
What Went Well
- Monitoring detected OOMKilled quickly
- Heap dump analysis identified root cause
What Went Poorly
- Code review missed missing eviction logic
- No memory leak tests in CI
Action Items
- [ ] Add eviction policy to cache (TTL or LRU)
- [ ] Add memory leak detection tests
- [ ] Review code for similar patterns
- [ ] Add memory usage alerting
Blameless Conclusion
This was a process gap: no memory leak testing and insufficient code review for caching logic.
Follow-ups
- P0Add eviction policy (TTL or LRU) to in-process cache in order-svc.— service owner
- P1Add memory leak detection tests (e.g., heap snapshot comparison) to CI pipeline.— platform team
- P1Review all services for similar unbounded caches.— service owner
- P2Set up memory usage alerting at 80% of limit for all services.— on-call SRE
- P2Update deployment checklist to include memory profiling before rollout.— platform team
Similar past incidents
lexical match (pg_trgm)
- 85%
[Eval][v2][en] Order service OOM crashloop following v3.7 deploy
Pods OOM-killed every ~20min, restart loop, p99 latency degraded, ~3% requests timing out
- 68%
[Eval][v1][en] Order service OOM crashloop following v3.7 deploy
Pods OOM-killed every ~20min, restart loop, p99 latency degraded, ~3% requests timing out
- 31%
[Eval][v2][en] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 29%
[Eval][v1][en] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 29%
[datadog] checkout p99 8s
checkout p99 8s