[Eval][v2][en] Order service OOM crashloop following v3.7 deploy
service: order-svccreated: 6/2/2026, 11:37:01 PM
Raw incident context
Time: 09:15 UTC. order-svc pods entering CrashLoopBackOff with OOMKilled exit reason. Kubernetes: - Pods: 12 total, 3 currently OOMKilled, 9 running - Memory limit: 512Mi, requests: 256Mi - Recent restarts/hour: 18 (up from baseline of 0–1) - HPA scaling has hit max replicas (12) Memory trend (last 24h): - Steady linear growth from 180Mi at 04:00 UTC to 500Mi at 09:00 UTC, then crash - Pre-deploy baseline: stable at 200Mi indefinitely Deploy history: - 04:00 UTC — order-svc v3.7 deployed (rolling, 100% complete by 04:08) - Changelog: "switched JSON parser to fast-json-stringify; added in-process request cache (Map-based, no eviction)" Logs: - No FATAL/ERROR pattern; pods exit silently - Heap profile from one captured pod shows: ~340Mi held by Map keyed on request ID, never cleared On-call: - 09:12 — first OOMKilled pod - 09:14 — pager - 09:16 — service still serving most traffic via remaining pods; checkout success rate dipped from 99.7% → 96.8%
Summary
order-svc pods are being OOMKilled in a restart loop due to a memory leak introduced in v3.7. The leak is caused by an in-process request cache (Map-based, no eviction) that grows unboundedly. 3 of 12 pods are down, p99 latency is degraded, and checkout success rate dropped from 99.7% to 96.8%.
Severity reasoning: User-facing outage: checkout success rate >1% (3.2% errors) for >5 min, revenue path broken (checkout failures), and error rate exceeds 1% threshold. Meets SEV1 criteria.
deepseek-chat·prompt v2·output: en·10713ms·1756↑ / 1582↓ tok·$0.00221
Root cause hypotheses
- highIn-process request cache (Map-based, no eviction) introduced in v3.7 grows unboundedly, consuming all pod memory.
Evidence: Heap profile shows ~340Mi held by Map keyed on request ID, never cleared. Memory trend shows linear growth from 180Mi to 500Mi over 5 hours, then OOM.
- mediumMemory limit (512Mi) is too low for the new cache, but cache is intended to be bounded.
Evidence: Pre-deploy baseline was stable at 200Mi. Cache was added without eviction policy, so it's not a limit issue but a leak.
- lowExternal dependency (e.g., database connection pool) is leaking connections, causing memory pressure.
Evidence: No connection pool errors in logs; heap profile shows Map as primary consumer, not connection objects.
Investigation checklist
- Check current pod status and memory usage across all replicas.
kubectl top pod -n prod -l app=order-svc --sort-by=memoryExpected: Memory usage near 500Mi for running pods; OOMKilled pods show 0 memory.
- Verify the heap profile from a captured pod to confirm Map memory dominance.
kubectl exec -n prod -l app=order-svc -- sh -c 'curl -s localhost:6060/debug/pprof/heap?debug=1 | head -100'Expected: Output shows 'Map' entries consuming ~340Mi, with no eviction logic.
- Check deploy changelog to confirm the cache addition and absence of eviction.
kubectl rollout history deployment/order-svc -n prod --revision=3 | grep -i cacheExpected: Line: 'added in-process request cache (Map-based, no eviction)'
- Monitor memory growth rate on a running pod to confirm linear trend.
kubectl exec -n prod -l app=order-svc -- sh -c 'while true; do cat /sys/fs/cgroup/memory/memory.usage_in_bytes; sleep 60; done'Expected: Memory increases by ~1-2Mi per minute, consistent with leak.
- Check if any recent traffic spike correlates with faster growth.
kubectl logs -n prod -l app=order-svc --since=1h | grep -c 'request'Expected: Request count per minute; compare with memory growth rate.
Mitigation plan
Rollback order-svc to previous version (v3.6) to remove the leaky cache.
Risk: Rollback will cause brief downtime during pod replacement; cache benefits lost. Safe operation.
Rollback: Re-deploy v3.7 if rollback causes issues; or re-apply with fixed cache.
If rollback not immediately possible, increase memory limit to 1Gi to buy time.
Risk: Higher memory usage may cause node pressure; temporary fix only.
Rollback: Revert memory limit to 512Mi after rollback or fix.
Scale up HPA max replicas to 16 to spread load and reduce per-pod memory pressure.
Risk: More pods may increase database connection count; monitor for connection pool exhaustion.
Rollback: Reduce max replicas back to 12.
Customer impact
Approximately 25% of checkout requests are failing or timing out. Users see 'Checkout failed' errors. Estimated 3% of all users affected. No data loss. ETA for full recovery: 30 minutes after rollback.
Postmortem draft
Summary
[FILL IN]
Timeline (UTC)
- 04:00 - order-svc v3.7 deployed with in-process request cache (no eviction)
- 09:12 - First pod OOMKilled
- 09:14 - Pager triggered
- 09:16 - Incident declared; checkout success rate drops to 96.8%
- [FILL IN] - Rollback initiated
- [FILL IN] - All pods stable, success rate recovers
Impact
- 3 of 12 pods OOMKilled in restart loop
- p99 latency degraded
- Checkout success rate: 99.7% → 96.8%
- Duration: [FILL IN]
Root Cause
Memory leak in in-process request cache (Map-based, no eviction) introduced in v3.7. Cache entries never removed, causing unbounded memory growth until OOM.
Detection
Pager alert for OOMKilled pods; success rate monitoring showed dip.
Response
- Identified memory leak via heap profile
- Rolled back to v3.6
- Increased memory limit temporarily
What Went Well
- Quick detection via monitoring
- Heap profile available for diagnosis
What Went Poorly
- No eviction policy in cache design
- No memory limit testing before deploy
Action Items
- [FILL IN]
Follow-ups
- P0Add eviction policy (TTL or LRU) to in-process cache before re-enabling.— service owner
- P1Set up memory usage alert at 80% of limit to warn before OOM.— on-call SRE
- P1Add memory leak detection test in CI (e.g., heap profile diff).— platform team
- P2Review deploy process for performance regression testing.— service owner
Similar past incidents
lexical match (pg_trgm)
- 85%
[Eval][v2][en] Order service OOM crashloop following v3.7 deploy
Pods OOM-killed every ~20min, restart loop, p99 latency degraded, ~3% requests timing out
- 63%
[Eval][v1][en] Order service OOM crashloop following v3.7 deploy
Pods OOM-killed every ~20min, restart loop, p99 latency degraded, ~3% requests timing out
- 30%
[Eval][v2][en] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 30%
[datadog] checkout p99 8s
checkout p99 8s
- 29%
[Eval][v2][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts