[Eval][v3][en] Order service OOM crashloop following v3.7 deploy
service: order-svccreated: 6/2/2026, 11:54:19 PM
Raw incident context
Time: 09:15 UTC. order-svc pods entering CrashLoopBackOff with OOMKilled exit reason. Kubernetes: - Pods: 12 total, 3 currently OOMKilled, 9 running - Memory limit: 512Mi, requests: 256Mi - Recent restarts/hour: 18 (up from baseline of 0–1) - HPA scaling has hit max replicas (12) Memory trend (last 24h): - Steady linear growth from 180Mi at 04:00 UTC to 500Mi at 09:00 UTC, then crash - Pre-deploy baseline: stable at 200Mi indefinitely Deploy history: - 04:00 UTC — order-svc v3.7 deployed (rolling, 100% complete by 04:08) - Changelog: "switched JSON parser to fast-json-stringify; added in-process request cache (Map-based, no eviction)" Logs: - No FATAL/ERROR pattern; pods exit silently - Heap profile from one captured pod shows: ~340Mi held by Map keyed on request ID, never cleared On-call: - 09:12 — first OOMKilled pod - 09:14 — pager - 09:16 — service still serving most traffic via remaining pods; checkout success rate dipped from 99.7% → 96.8%
Summary
order-svc pods are being OOMKilled every ~20 minutes due to a memory leak introduced in v3.7, which added an in-process request cache (Map-based, no eviction) that grows unboundedly. The leak causes pods to crash, leading to a restart loop and degraded p99 latency with ~3% request timeouts. The incident is ongoing with 3 of 12 pods currently down.
Severity reasoning: User-facing outage: error rate > 1% for >5 min (checkout success rate dropped from 99.7% to 96.8%, exceeding 1% threshold). Revenue path broken (checkout failures). SEV1 criteria met.
deepseek-chat·prompt v3·output: en·10271ms·2007↑ / 1506↓ tok·$0.00220
Root cause hypotheses
- highIn-process request cache (Map-based, no eviction) added in v3.7 leaks memory by never releasing entries.
Evidence: Heap profile shows ~340Mi held by Map keyed on request ID, never cleared. Changelog confirms 'added in-process request cache (Map-based, no eviction)'. Memory grows linearly from 180Mi to 500Mi over 5 hours.
- lowJSON parser switch to fast-json-stringify causes memory fragmentation or retention.
Evidence: Changelog mentions parser change. However, heap profile points to Map as primary consumer. No direct evidence of parser leak.
- lowExternal dependency (e.g., database connection pool) not releasing resources, causing memory growth.
Evidence: No evidence of connection pool issues in logs. Memory growth correlates with deployment time, not external factors.
Investigation checklist
- Confirm memory leak source by examining heap dump from a running pod.
kubectl exec -n prod order-svc-pod-name -- jcmd 1 GC.heap_dump /tmp/heap.hprof && kubectl cp prod/order-svc-pod-name:/tmp/heap.hprof ./heap.hprofExpected: Heap dump shows Map with many entries; no eviction logic.
- Check memory usage trend across all pods to confirm leak is widespread.
kubectl top pod -n prod -l app=order-svc --sort-by=memory | head -20Expected: All pods show memory usage >400Mi, growing over time.
- Verify that the cache is the only significant memory consumer by analyzing heap histogram.
kubectl exec -n prod order-svc-pod-name -- jcmd 1 GC.class_histogram | grep -i mapExpected: Map entries dominate memory usage.
- Check if there is a feature flag to disable the cache.
kubectl get configmap order-svc-config -n prod -o yaml | grep -i cacheExpected: Config may have a flag like 'cache.enabled: true'.
Mitigation plan
Immediately disable the in-process cache via feature flag or environment variable, then restart pods.
Risk: Temporarily increases latency due to cache miss, but prevents OOM. No data loss.
Rollback: Re-enable cache by reverting config change and restarting pods.
If no feature flag, roll back order-svc to v3.6 (previous version) using kubectl rollout undo.
Risk: Rollback may cause brief downtime during pod replacement. Safe operation.
Rollback: Re-apply v3.7 deployment if needed.
Increase memory limit to 1Gi as a temporary buffer to reduce crash frequency.
Risk: May mask the leak and increase resource usage. Could cause node pressure.
Rollback: Revert memory limit to 512Mi.
Customer impact
Approximately 3% of checkout requests are timing out, affecting users trying to place orders. The service is partially degraded; some users may see errors or delays. No data loss has occurred.
Postmortem draft
Postmortem: order-svc OOMKilled due to memory leak
Summary
On [DATE], order-svc v3.7 was deployed at 04:00 UTC, introducing an in-process request cache without eviction. This caused a memory leak leading to OOMKilled pods starting at 09:12 UTC. The incident was mitigated by [MITIGATION].
Timeline (UTC)
- 04:00 - v3.7 deployed
- 09:12 - First OOMKilled pod
- 09:14 - Pager triggered
- 09:16 - Checkout success rate drops to 96.8%
- [FILL IN] - Mitigation applied
- [FILL IN] - All pods stable
Impact
- 3% of checkout requests timed out
- 3 of 12 pods were OOMKilled at peak
- p99 latency degraded
Root Cause
A Map-based request cache added in v3.7 had no eviction policy, causing unbounded memory growth.
Detection
Alerted by OOMKilled pod events and success rate drop. No prior warning from memory trend.
Response
- [FILL IN] steps taken
What Went Well
- Quick detection via pager
- Heap dump analysis identified root cause
What Went Poorly
- No memory limit monitoring before crash
- Cache added without review of memory implications
Action Items
- Add eviction policy to cache (LRU or TTL)
- Set up memory usage alerts at 80% of limit
- Add load test with sustained traffic before deploying caching changes
Follow-ups
- P0Add eviction policy to in-process cache (LRU or TTL-based).— order-svc team
- P1Set up memory usage alert at 80% of pod memory limit.— platform team
- P1Add load test with sustained traffic to detect memory leaks before deployment.— QA team
- P2Review deployment process to include memory profiling for changes that add caching.— platform team
Similar past incidents
lexical match (pg_trgm)
- 69%
[Eval][v1][en] Order service OOM crashloop following v3.7 deploy
Pods OOM-killed every ~20min, restart loop, p99 latency degraded, ~3% requests timing out
- 69%
[Eval][v2][en] Order service OOM crashloop following v3.7 deploy
Pods OOM-killed every ~20min, restart loop, p99 latency degraded, ~3% requests timing out
- 30%
[Eval][v2][en] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 28%
[Eval][v2][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 27%
[datadog] checkout p99 8s
checkout p99 8s