← All incidents

[Eval][v2][en] Order service OOM crashloop following v3.7 deploy

service: order-svccreated: 6/2/2026, 11:52:55 PM

Raw incident context

Time: 09:15 UTC. order-svc pods entering CrashLoopBackOff with OOMKilled exit reason.

Kubernetes:
- Pods: 12 total, 3 currently OOMKilled, 9 running
- Memory limit: 512Mi, requests: 256Mi
- Recent restarts/hour: 18 (up from baseline of 0–1)
- HPA scaling has hit max replicas (12)

Memory trend (last 24h):
- Steady linear growth from 180Mi at 04:00 UTC to 500Mi at 09:00 UTC, then crash
- Pre-deploy baseline: stable at 200Mi indefinitely

Deploy history:
- 04:00 UTC — order-svc v3.7 deployed (rolling, 100% complete by 04:08)
- Changelog: "switched JSON parser to fast-json-stringify; added in-process request cache (Map-based, no eviction)"

Logs:
- No FATAL/ERROR pattern; pods exit silently
- Heap profile from one captured pod shows: ~340Mi held by Map keyed on request ID, never cleared

On-call:
- 09:12 — first OOMKilled pod
- 09:14 — pager
- 09:16 — service still serving most traffic via remaining pods; checkout success rate dipped from 99.7% → 96.8%

Summary

SEV1

order-svc pods are being OOMKilled every ~20 minutes due to a memory leak introduced in v3.7. The leak is caused by an unbounded in-process request cache (Map-based, no eviction) that grows linearly with request volume. 3 of 12 pods are in CrashLoopBackOff, p99 latency is degraded, and checkout success rate dropped from 99.7% to 96.8%.

Severity reasoning: User-facing outage: checkout success rate >1% (3.2% error rate) for >5 minutes, revenue path broken (checkout failures), and error rate exceeds 1% threshold. Meets SEV1 criteria.

deepseek-chat·prompt v2·output: en·10284ms·1756↑ / 1524↓ tok·$0.00215

Root cause hypotheses

  • highUnbounded in-process request cache introduced in v3.7 causes memory leak

    Evidence: Heap profile shows ~340Mi held by Map keyed on request ID, never cleared. Memory grows linearly from 180Mi to 500Mi over 5 hours, then OOM. Changelog confirms 'Map-based, no eviction'.

  • mediumMemory limit too low for normal traffic after v3.7 changes

    Evidence: Memory limit is 512Mi, but pre-deploy baseline was 200Mi. New JSON parser may increase per-request memory, but heap profile shows cache dominates.

  • lowExternal dependency (e.g., database) causing request accumulation

    Evidence: No slow queries or connection pool exhaustion in logs. Cache is in-process, not external.

Investigation checklist

  1. Check memory usage trend per pod
    kubectl top pod -n prod -l app=order-svc --sort-by=memory | head -15

    Expected: Memory usage near 500Mi for running pods, confirming leak

  2. Inspect heap dump for cache size
    kubectl exec -n prod deploy/order-svc -- sh -c 'jcmd 1 GC.heap_dump /tmp/heap.hprof && jhat /tmp/heap.hprof'

    Expected: Map entries with request IDs, no eviction logic

  3. Check deploy rollout status
    kubectl rollout status deploy/order-svc -n prod --revision=current

    Expected: v3.7 deployed at 04:00 UTC

  4. Check HPA metrics
    kubectl describe hpa order-svc -n prod

    Expected: Current replicas at max (12), memory utilization > 90%

  5. Check if cache is configurable via feature flag
    kubectl exec -n prod deploy/order-svc -- env | grep CACHE

    Expected: No env var for cache eviction or size limit

Mitigation plan

  • Rollback order-svc to v3.6 (pre-cache version) to stop memory leak immediately

    Risk: Brief traffic disruption during rollback; v3.6 may have different JSON parser but no cache. Safe rollback via Kubernetes rolling update.

    Rollback: Re-deploy v3.7 if rollback causes issues (unlikely). Command: kubectl rollout undo deploy/order-svc -n prod

  • If rollback not possible, scale up replicas temporarily to reduce per-pod memory pressure

    Risk: Only delays OOM; does not fix leak. May hit cluster resource limits.

    Rollback: Scale back down after fix. Command: kubectl scale deploy/order-svc -n prod --replicas=20

  • Add memory-based HPA to auto-scale before OOM

    Risk: May cause rapid scaling; still need fix.

    Rollback: Remove HPA. Command: kubectl delete hpa order-svc -n prod

Customer impact

Approximately 25% of checkout requests are failing or timing out. Users may see 'order failed' errors or slow page loads. Estimated 3% of all traffic affected. No data loss expected.

Postmortem draft

Summary

[FILL IN]

Timeline (UTC)

  • 04:00: order-svc v3.7 deployed with unbounded cache
  • 09:12: First OOMKilled pod
  • 09:14: Pager triggered
  • 09:16: Incident declared (SEV1)
  • [FILL IN] Rollback initiated
  • [FILL IN] Service restored

Impact

  • Checkout success rate dropped from 99.7% to 96.8% for ~20 minutes
  • 3 of 12 pods in CrashLoopBackOff
  • p99 latency degraded

Root Cause

Memory leak from unbounded in-process request cache (Map-based, no eviction) introduced in v3.7.

Detection

Pager triggered by OOMKilled pod alerts. Success rate drop detected via monitoring.

Response

  • Rollback to v3.6
  • [FILL IN] Additional steps

What Went Well

  • Quick detection via pod OOM alerts
  • Heap dump analysis identified root cause

What Went Poorly

  • No memory leak testing in staging
  • No cache eviction or size limit
  • HPA max replicas reached

Action Items

  • [FILL IN] Add cache eviction (TTL or LRU)
  • [FILL IN] Add memory limit alerts at 80%
  • [FILL IN] Add load test with sustained traffic

Follow-ups

  • P0Add TTL-based eviction to in-process cacheservice owner
  • P1Set memory limit alert at 80% (400Mi) for order-svcon-call SRE
  • P1Add load test with sustained traffic to stagingplatform team
  • P2Review code review process for memory-sensitive changesservice owner
  • P2Increase HPA max replicas to 20platform team