[Eval][v1][en] Order service OOM crashloop following v3.7 deploy
service: order-svccreated: 5/25/2026, 10:20:36 PM
Raw incident context
Time: 09:15 UTC. order-svc pods entering CrashLoopBackOff with OOMKilled exit reason. Kubernetes: - Pods: 12 total, 3 currently OOMKilled, 9 running - Memory limit: 512Mi, requests: 256Mi - Recent restarts/hour: 18 (up from baseline of 0–1) - HPA scaling has hit max replicas (12) Memory trend (last 24h): - Steady linear growth from 180Mi at 04:00 UTC to 500Mi at 09:00 UTC, then crash - Pre-deploy baseline: stable at 200Mi indefinitely Deploy history: - 04:00 UTC — order-svc v3.7 deployed (rolling, 100% complete by 04:08) - Changelog: "switched JSON parser to fast-json-stringify; added in-process request cache (Map-based, no eviction)" Logs: - No FATAL/ERROR pattern; pods exit silently - Heap profile from one captured pod shows: ~340Mi held by Map keyed on request ID, never cleared On-call: - 09:12 — first OOMKilled pod - 09:14 — pager - 09:16 — service still serving most traffic via remaining pods; checkout success rate dipped from 99.7% → 96.8%
Summary
order-svc pods are OOM-killed every ~20 minutes due to a memory leak introduced in v3.7. A new in-process request cache (Map-based, no eviction) holds ~340Mi of heap, causing pods to hit the 512Mi limit. 3 of 12 pods are in CrashLoopBackOff; p99 latency degraded and ~3% of requests timeout. The incident is ongoing.
Severity reasoning: User-facing degradation (96.8% success rate, p99 latency increase) but not a full outage; revenue impact possible but limited. Scope: subset of users hitting degraded pods. Reversibility: rollback or fix can restore service.
deepseek-chat·prompt v1·output: en·10555ms
Root cause hypotheses
- highIn-process request cache introduced in v3.7 has no eviction policy, causing unbounded memory growth.
Evidence: Heap profile shows ~340Mi held by Map keyed on request ID; code change added Map-based cache without eviction.
- mediumMemory limit (512Mi) is too low for the new cache, but cache is the primary driver.
Evidence: Pre-deploy baseline stable at 200Mi; after deploy memory grows linearly to 500Mi then OOM.
- lowJSON parser switch (fast-json-stringify) causes memory fragmentation or leaks.
Evidence: No direct evidence; heap profile points to Map, not JSON buffers.
Investigation checklist
- Confirm the cache is the source of memory growth by inspecting heap dump from a running pod.
kubectl exec -n <namespace> <pod-name> -- jcmd 1 GC.heap_dump /tmp/heap.hprof && kubectl cp <namespace>/<pod-name>:/tmp/heap.hprof ./heap.hprof && jhat heap.hprof | grep -i 'Map'Expected: Large Map object (keyed by request ID) consuming >300Mi.
- Check if the cache has any eviction or size limit in code.
kubectl exec -n <namespace> <pod-name> -- cat /app/index.js | grep -E 'Map|evict|max|limit'Expected: No eviction logic found; Map is unbounded.
- Verify memory limit and request settings for the deployment.
kubectl get deployment order-svc -n <namespace> -o json | jq '.spec.template.spec.containers[0].resources'Expected: limits.memory: 512Mi, requests.memory: 256Mi
- Check if the cache is shared across requests (global variable) or per-request.
kubectl exec -n <namespace> <pod-name> -- cat /app/index.js | grep -E 'const cache|let cache|var cache'Expected: Global Map variable, not scoped per request.
Mitigation plan
Rollback order-svc to v3.6 (previous stable version) to stop memory leak immediately.
Risk: Brief traffic disruption during rollback; no data loss as cache is in-memory only.
Rollback: Re-deploy v3.7 if rollback causes issues, or re-apply v3.7 after fix.
If rollback not possible, increase memory limit to 1Gi temporarily to buy time.
Risk: Higher resource usage; may cause node pressure. Does not fix leak.
Rollback: Revert memory limit to 512Mi after fix is deployed.
Disable the cache via feature flag or environment variable if available.
Risk: May increase latency if cache was reducing load; no data loss.
Rollback: Re-enable cache after fix.
Customer impact
Approximately 3% of checkout requests are timing out or failing. Users may see errors during checkout. Service is partially degraded; remaining pods handle most traffic.
Postmortem draft
Postmortem: order-svc OOMKilled due to unbounded cache
Summary
order-svc v3.7 introduced an in-process request cache without eviction, causing memory growth and OOM kills.
Timeline
- 04:00 UTC: v3.7 deployed
- 09:12 UTC: first OOMKilled pod
- 09:14 UTC: pager triggered
- 09:16 UTC: incident declared
- [Mitigation time]: rollback to v3.6
Impact
- 3 of 12 pods in CrashLoopBackOff
- Checkout success rate: 99.7% → 96.8%
- p99 latency degraded
Root Cause
Unbounded Map cache in order-svc v3.7 (commit [hash]). No eviction or size limit.
What Went Well
- Monitoring detected OOM quickly
- Heap dump captured before crash
What Went Poorly
- Code review missed missing eviction
- No memory regression test before deploy
Action Items
- [ ] Add eviction policy (TTL or LRU) to cache
- [ ] Add memory limit alerts at 80% of limit
- [ ] Add load test with sustained traffic
- [ ] Enforce code review checklist for resource management
Follow-ups
- P0Add eviction policy (TTL or LRU) to the request cache in order-svc.— service owner
- P1Add memory usage alert at 80% of container limit.— on-call SRE
- P1Add load test with sustained traffic to detect memory leaks before deploy.— platform team
- P2Update code review checklist to include resource management (caches, connections).— service owner
Similar past incidents
lexical match (pg_trgm)
- 63%
[Eval][v2][en] Order service OOM crashloop following v3.7 deploy
Pods OOM-killed every ~20min, restart loop, p99 latency degraded, ~3% requests timing out
- 29%
[Eval][v2][en] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 28%
[Eval][v2][en] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 28%
[Eval][v2][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 26%
[datadog] checkout p99 8s
checkout p99 8s