[Eval][v2][en] Order service OOM crashloop following v3.7 deploy
service: order-svccreated: 6/2/2026, 11:52:41 PM
Raw incident context
Time: 09:15 UTC. order-svc pods entering CrashLoopBackOff with OOMKilled exit reason. Kubernetes: - Pods: 12 total, 3 currently OOMKilled, 9 running - Memory limit: 512Mi, requests: 256Mi - Recent restarts/hour: 18 (up from baseline of 0–1) - HPA scaling has hit max replicas (12) Memory trend (last 24h): - Steady linear growth from 180Mi at 04:00 UTC to 500Mi at 09:00 UTC, then crash - Pre-deploy baseline: stable at 200Mi indefinitely Deploy history: - 04:00 UTC — order-svc v3.7 deployed (rolling, 100% complete by 04:08) - Changelog: "switched JSON parser to fast-json-stringify; added in-process request cache (Map-based, no eviction)" Logs: - No FATAL/ERROR pattern; pods exit silently - Heap profile from one captured pod shows: ~340Mi held by Map keyed on request ID, never cleared On-call: - 09:12 — first OOMKilled pod - 09:14 — pager - 09:16 — service still serving most traffic via remaining pods; checkout success rate dipped from 99.7% → 96.8%
Summary
order-svc pods are being OOMKilled in a restart loop due to a memory leak introduced in v3.7. The leak is caused by an unbounded in-process request cache (Map) that is never evicted. 3 of 12 pods are down, p99 latency is degraded, and checkout success rate dropped from 99.7% to 96.8%.
Severity reasoning: User-facing outage: checkout success rate >1% (3.2% errors) for >5 min, revenue path broken (checkout failures), and error rate >1% threshold met.
deepseek-chat·prompt v2·output: en·9227ms·1756↑ / 1383↓ tok·$0.00200
Root cause hypotheses
- highUnbounded in-process request cache (Map) introduced in v3.7 causes memory exhaustion
Evidence: Heap profile shows ~340Mi held by Map keyed on request ID, never cleared. Memory grows linearly from 180Mi to 500Mi over 5 hours.
- lowJSON parser switch (fast-json-stringify) has a memory leak
Evidence: No direct evidence; memory growth pattern is linear and consistent with cache, not parser. Parser is unlikely to hold state across requests.
- mediumIncreased traffic volume causing normal memory usage to exceed limits
Evidence: HPA at max replicas (12) and memory per pod growing beyond 512Mi limit. However, baseline was stable at 200Mi before deploy, and traffic patterns haven't changed significantly.
Investigation checklist
- Check memory usage per pod over time to confirm leak pattern
kubectl top pod -n prod -l app=order-svc --sort-by=memory --no-headers | head -12Expected: Memory usage near 500Mi for recently restarted pods, growing over time
- Inspect heap dump from a running pod to identify largest allocations
kubectl exec -n prod deploy/order-svc -- sh -c 'curl -s localhost:6060/debug/pprof/heap?debug=1 | head -100'Expected: Top allocation shows Map entries with request IDs, no eviction logic
- Check deploy changelog to confirm cache introduction
kubectl rollout history deployment/order-svc -n prod --revision=current | grep -i 'cache\|eviction\|map'Expected: Changelog mentions 'in-process request cache (Map-based, no eviction)'
- Verify memory limit and request settings
kubectl get deployment order-svc -n prod -o jsonpath='{.spec.template.spec.containers[0].resources}'Expected: limits: memory: 512Mi, requests: memory: 256Mi
- Check if HPA is at max replicas and if scaling would help
kubectl get hpa order-svc -n prod -o wideExpected: Current replicas: 12/12, target memory utilization >80%
Mitigation plan
Rollback order-svc to previous version (v3.6) to remove the leaky cache
Risk: Rollback may cause brief downtime during pod replacement. Ensure traffic drains gracefully.
Rollback: Re-deploy v3.7 by running 'kubectl rollout undo deployment/order-svc -n prod --to-revision=<previous>'
Increase memory limit to 1Gi temporarily to buy time for investigation
Risk: Higher memory usage may cause node pressure; monitor node memory. This is a band-aid, not a fix.
Rollback: Revert memory limit to 512Mi by editing deployment: 'kubectl edit deployment order-svc -n prod'
Add a TTL-based eviction to the cache or disable it via feature flag if available
Risk: If feature flag exists, disabling may cause performance regression but prevents OOM. If not, code change required.
Rollback: Re-enable cache via feature flag or revert code change
Customer impact
Approximately 25% of checkout requests are failing or timing out. Users may see 'Checkout failed' errors or long load times. Estimated 3% of all traffic affected. No ETA yet.
Postmortem draft
Summary
[FILL IN]
Timeline (UTC)
[FILL IN]
Impact
[FILL IN]
Root Cause
[FILL IN]
Detection
[FILL IN]
Response
[FILL IN]
What Went Well
[FILL IN]
What Went Poorly
[FILL IN]
Action Items
[FILL IN]
Follow-ups
- P1Add memory leak detection test to CI/CD pipeline (e.g., load test with memory profiling)— platform team
- P0Implement cache eviction policy (TTL or LRU) for in-process caches— service owner
- P1Set up memory usage alerts with trend detection (linear growth over hours)— on-call SRE
- P2Review all in-process caches across services for unbounded growth— platform team
- P1Add heap dump collection on OOM to improve debugging— platform team
Similar past incidents
lexical match (pg_trgm)
- 78%
[Eval][v2][en] Order service OOM crashloop following v3.7 deploy
Pods OOM-killed every ~20min, restart loop, p99 latency degraded, ~3% requests timing out
- 61%
[Eval][v1][en] Order service OOM crashloop following v3.7 deploy
Pods OOM-killed every ~20min, restart loop, p99 latency degraded, ~3% requests timing out
- 29%
[Eval][v2][en] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 29%
[Eval][v2][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 28%
[Eval][v1][en] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted