[Eval][v2][en] Order service OOM crashloop following v3.7 deploy
service: order-svccreated: 6/2/2026, 11:52:28 PM
Raw incident context
Time: 09:15 UTC. order-svc pods entering CrashLoopBackOff with OOMKilled exit reason. Kubernetes: - Pods: 12 total, 3 currently OOMKilled, 9 running - Memory limit: 512Mi, requests: 256Mi - Recent restarts/hour: 18 (up from baseline of 0–1) - HPA scaling has hit max replicas (12) Memory trend (last 24h): - Steady linear growth from 180Mi at 04:00 UTC to 500Mi at 09:00 UTC, then crash - Pre-deploy baseline: stable at 200Mi indefinitely Deploy history: - 04:00 UTC — order-svc v3.7 deployed (rolling, 100% complete by 04:08) - Changelog: "switched JSON parser to fast-json-stringify; added in-process request cache (Map-based, no eviction)" Logs: - No FATAL/ERROR pattern; pods exit silently - Heap profile from one captured pod shows: ~340Mi held by Map keyed on request ID, never cleared On-call: - 09:12 — first OOMKilled pod - 09:14 — pager - 09:16 — service still serving most traffic via remaining pods; checkout success rate dipped from 99.7% → 96.8%
Summary
order-svc pods are being OOMKilled every ~20 minutes due to a memory leak introduced in v3.7. The leak is caused by an in-process request cache (Map-based, no eviction) that grows unboundedly. 3 of 12 pods are in CrashLoopBackOff, p99 latency is degraded, and checkout success rate dropped from 99.7% to 96.8%.
Severity reasoning: User-facing outage: checkout success rate >1% (3.2% errors) for >5 minutes, revenue path broken (checkout failures), and error rate exceeds 1% threshold. This meets SEV1 criteria.
deepseek-chat·prompt v2·output: en·10558ms·1756↑ / 1634↓ tok·$0.00227
Root cause hypotheses
- highMemory leak from in-process request cache (Map-based, no eviction) introduced in v3.7
Evidence: Heap profile shows ~340Mi held by Map keyed on request ID, never cleared. Memory grows linearly from 180Mi to 500Mi over ~5h, then OOM. Changelog confirms 'added in-process request cache (Map-based, no eviction)'.
- mediumIncreased request rate causing more cache entries to accumulate faster
Evidence: Memory growth rate correlates with time of day (higher traffic after 04:00 UTC). However, baseline was stable at 200Mi before deploy, so rate alone cannot explain leak.
- lowMemory limit too low for new cache overhead
Evidence: Memory limit is 512Mi, and pods crash at ~500Mi. Pre-deploy baseline was 200Mi. If cache were bounded, 512Mi might be sufficient; but unbounded growth exceeds limit.
Investigation checklist
- Check memory usage trend per pod to confirm linear growth pattern
kubectl top pod -n prod -l app=order-svc --sort-by=memory --no-headers | head -12Expected: Memory usage near 500Mi for running pods, with recent restarts showing lower values
- Inspect heap profile of a running pod to confirm Map size
kubectl exec -n prod deploy/order-svc -- sh -c 'curl -s localhost:6060/debug/pprof/heap?debug=1 | grep -A5 "map\[" | head -20'Expected: Large number of entries in a map keyed by request ID, no eviction logic
- Check deploy rollout status and verify v3.7 is fully rolled out
kubectl rollout status deploy/order-svc -n prod --revision=$(kubectl rollout history deploy/order-svc -n prod | tail -2 | head -1 | awk '{print $1}')Expected: deployment "order-svc" successfully rolled out
- Check if any recent config changes or feature flags enabled the cache
kubectl get configmap -n prod order-svc-config -o yaml | grep -i cacheExpected: No cache-related config or a flag that enables the cache
- Verify HPA metrics and current replica count
kubectl get hpa -n prod order-svc-hpa -o wideExpected: Current replicas at max (12), target memory utilization > 80%
Mitigation plan
Rollback order-svc to v3.6 (previous stable version) to remove the memory leak
Risk: Rollback may cause brief downtime during pod replacement; ensure traffic drains properly. No data loss expected.
Rollback: Re-apply v3.7 deployment if rollback causes issues, or redeploy v3.6 with same config.
If rollback is not immediate, scale up HPA max replicas to 16 to distribute load and reduce per-pod memory pressure
Risk: Temporary increase in resource usage; may not prevent OOM if leak continues, but buys time.
Rollback: Revert HPA max replicas to 12 after rollback or fix.
Add a memory limit of 1Gi temporarily to prevent OOM kills while investigating
Risk: Higher memory usage on nodes; may cause node pressure if many pods run. Monitor node memory.
Rollback: Revert limit to 512Mi after fix is deployed.
Customer impact
Approximately 25% of checkout requests are failing or timing out. Users may see errors during order placement. Service is partially degraded; remaining pods are serving most traffic but with increased latency.
Postmortem draft
Summary
[FILL IN]
Timeline (UTC)
- 04:00 — order-svc v3.7 deployed
- 09:12 — first OOMKilled pod
- 09:14 — pager triggered
- 09:16 — on-call acknowledges, checkout success rate drops to 96.8%
- [FILL IN] — rollback initiated
- [FILL IN] — service stable
Impact
- 3 of 12 pods in CrashLoopBackOff
- Checkout success rate: 99.7% → 96.8%
- p99 latency degraded
- Estimated 3% of requests timing out
Root Cause
Memory leak from in-process request cache (Map-based, no eviction) introduced in v3.7. Cache entries accumulate indefinitely, causing pod memory to grow linearly until OOM.
Detection
- Pod OOMKilled events triggered pager
- Memory trend showed linear growth from 180Mi to 500Mi over 5 hours
Response
- [FILL IN]
What Went Well
- [FILL IN]
What Went Poorly
- [FILL IN]
Action Items
- [FILL IN]
Follow-ups
- P1Add memory leak detection test for in-process caches in CI/CD pipeline— platform team
- P1Implement bounded cache with TTL or LRU eviction in order-svc— service owner
- P2Set up memory usage alert with linear growth detection (e.g., PromQL predict_linear)— on-call SRE
- P2Review deploy process to include canary or gradual rollout for memory-sensitive changes— platform team
- P2Add heap profile snapshot to pod termination logs for future debugging— service owner
Similar past incidents
lexical match (pg_trgm)
- 90%
[Eval][v2][en] Order service OOM crashloop following v3.7 deploy
Pods OOM-killed every ~20min, restart loop, p99 latency degraded, ~3% requests timing out
- 68%
[Eval][v1][en] Order service OOM crashloop following v3.7 deploy
Pods OOM-killed every ~20min, restart loop, p99 latency degraded, ~3% requests timing out
- 32%
[Eval][v2][en] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 29%
[Eval][v1][en] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 29%
[Eval][v2][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts