[Eval][v3][en] Order service OOM crashloop following v3.7 deploy
service: order-svccreated: 6/2/2026, 11:54:05 PM
Raw incident context
Time: 09:15 UTC. order-svc pods entering CrashLoopBackOff with OOMKilled exit reason. Kubernetes: - Pods: 12 total, 3 currently OOMKilled, 9 running - Memory limit: 512Mi, requests: 256Mi - Recent restarts/hour: 18 (up from baseline of 0–1) - HPA scaling has hit max replicas (12) Memory trend (last 24h): - Steady linear growth from 180Mi at 04:00 UTC to 500Mi at 09:00 UTC, then crash - Pre-deploy baseline: stable at 200Mi indefinitely Deploy history: - 04:00 UTC — order-svc v3.7 deployed (rolling, 100% complete by 04:08) - Changelog: "switched JSON parser to fast-json-stringify; added in-process request cache (Map-based, no eviction)" Logs: - No FATAL/ERROR pattern; pods exit silently - Heap profile from one captured pod shows: ~340Mi held by Map keyed on request ID, never cleared On-call: - 09:12 — first OOMKilled pod - 09:14 — pager - 09:16 — service still serving most traffic via remaining pods; checkout success rate dipped from 99.7% → 96.8%
Summary
order-svc pods are being OOMKilled every ~20 minutes due to a memory leak introduced in v3.7. A new in-process request cache (Map-based, no eviction) holds ~340Mi of heap, causing pods to hit the 512Mi memory limit and crash. 3 of 12 pods are in CrashLoopBackOff, p99 latency is degraded, and checkout success rate dropped from 99.7% to 96.8%.
Severity reasoning: User-facing outage: checkout success rate dropped from 99.7% to 96.8% (>1% error rate) for >5 minutes. Revenue path is broken (checkout failures). This meets SEV1 criteria.
deepseek-chat·prompt v3·output: en·11132ms·2007↑ / 1708↓ tok·$0.00242
Root cause hypotheses
- highMemory leak from in-process request cache introduced in v3.7 (Map-based, no eviction).
Evidence: Heap profile shows ~340Mi held by Map keyed on request ID, never cleared. Memory grows linearly from 180Mi to 500Mi over 5 hours post-deploy, then crashes. Changelog confirms 'added in-process request cache (Map-based, no eviction)'.
- lowIncreased request rate causing cache to grow faster than expected.
Evidence: No evidence of rate change; memory growth is steady and correlates with deploy time, not traffic spikes.
- lowJSON parser (fast-json-stringify) memory leak.
Evidence: Heap profile shows Map as dominant consumer, not JSON parser. No other memory growth pattern.
Investigation checklist
- Confirm memory growth trend in Grafana for order-svc pods.
curl -s 'http://grafana.example.com/api/datasources/proxy/1/query?query=sum(container_memory_working_set_bytes{namespace="prod",pod=~"order-svc.*"})&start=2025-04-01T04:00:00Z&end=2025-04-01T09:30:00Z&step=60s' | jq '.data.result[0].values'Expected: Linear increase from ~180Mi to ~500Mi over 5 hours.
- Check heap dump of a running pod to confirm Map size.
kubectl exec -n prod deploy/order-svc -- jcmd 1 GC.heap_dump /tmp/heap.hprof && kubectl cp prod/order-svc-xxx:/tmp/heap.hprof ./heap.hprof && jhat heap.hprof | grep -i mapExpected: Map object holding ~340Mi, keyed by request ID.
- Verify deploy timestamp and changelog for v3.7.
kubectl rollout history deployment/order-svc -n prod | grep v3.7Expected: Revision 42, deployed at 2025-04-01T04:00:00Z, changelog: 'switched JSON parser to fast-json-stringify; added in-process request cache (Map-based, no eviction)'.
- Check if cache eviction or TTL exists in code.
kubectl exec -n prod deploy/order-svc -- grep -r 'evict\|TTL\|expire' /app/srcExpected: No matches found, confirming no eviction logic.
Mitigation plan
Rollback order-svc to v3.6 (previous stable version).
Risk: Brief traffic disruption during rollback; no data loss.
Rollback: Re-deploy v3.7 if rollback causes issues (unlikely).
If rollback not possible, increase memory limit to 1Gi temporarily.
Risk: May mask leak; could cause node memory pressure if multiple pods scale up.
Rollback: Revert limit to 512Mi after fix is deployed.
Scale down HPA max replicas to 6 to reduce cluster load.
Risk: May increase latency under high traffic; could overwhelm remaining pods.
Rollback: Restore max replicas to 12.
Customer impact
Approximately 3% of checkout requests are failing or timing out. Users may see errors when placing orders. The issue started around 09:12 UTC and is ongoing.
Postmortem draft
Postmortem: order-svc OOMKilled due to memory leak
Summary
On 2025-04-01, order-svc v3.7 introduced a memory leak via an unbounded in-process request cache, causing pods to OOMKill every ~20 minutes. Checkout success rate dropped from 99.7% to 96.8%. Mitigated by rollback to v3.6.
Timeline (UTC)
- 04:00 - v3.7 deployed
- 09:12 - First OOMKilled pod
- 09:14 - Pager triggered
- 09:16 - Incident declared (SEV1)
- [FILL IN] - Rollback initiated
- [FILL IN] - All pods stable
Impact
- 3 of 12 pods in CrashLoopBackOff
- Checkout success rate: 96.8% (vs 99.7% baseline)
- p99 latency degraded
- Duration: [FILL IN] minutes
Root Cause
In-process request cache (Map) added in v3.7 without eviction or TTL. Cache grew unbounded, consuming ~340Mi heap per pod, exceeding 512Mi limit.
Detection
- Pager from OOMKilled alerts
- Grafana memory trend showed linear growth
Response
- Rolled back to v3.6
- Increased memory limit temporarily
What Went Well
- Quick detection via OOMKilled alerts
- Heap dump analysis identified root cause rapidly
What Went Poorly
- No code review caught missing eviction
- No load testing with realistic cache growth
Action Items
- [ ] Add TTL/eviction to cache (P0)
- [ ] Add memory leak detection in CI (P1)
- [ ] Review changelog for similar patterns (P2)
Follow-ups
- P0Add TTL or LRU eviction to in-process request cache in order-svc.— order-svc team
- P1Add memory leak detection test (e.g., heap growth over time) to CI pipeline.— platform team
- P1Review all recent deployments for similar unbounded caches or missing eviction.— service owners
- P2Set up alert on memory growth rate (e.g., >100Mi/hour) for order-svc.— on-call SRE
- P2Update postmortem with exact timestamps and customer impact numbers.— incident commander
Similar past incidents
lexical match (pg_trgm)
- 83%
[Eval][v1][en] Order service OOM crashloop following v3.7 deploy
Pods OOM-killed every ~20min, restart loop, p99 latency degraded, ~3% requests timing out
- 75%
[Eval][v2][en] Order service OOM crashloop following v3.7 deploy
Pods OOM-killed every ~20min, restart loop, p99 latency degraded, ~3% requests timing out
- 32%
[Eval][v2][en] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 30%
[Eval][v1][en] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 29%
[Eval][v2][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts