← All incidents

[Eval][v1][en] Order service OOM crashloop following v3.7 deploy

service: order-svccreated: 6/2/2026, 11:51:31 PM

Raw incident context

Time: 09:15 UTC. order-svc pods entering CrashLoopBackOff with OOMKilled exit reason.

Kubernetes:
- Pods: 12 total, 3 currently OOMKilled, 9 running
- Memory limit: 512Mi, requests: 256Mi
- Recent restarts/hour: 18 (up from baseline of 0–1)
- HPA scaling has hit max replicas (12)

Memory trend (last 24h):
- Steady linear growth from 180Mi at 04:00 UTC to 500Mi at 09:00 UTC, then crash
- Pre-deploy baseline: stable at 200Mi indefinitely

Deploy history:
- 04:00 UTC — order-svc v3.7 deployed (rolling, 100% complete by 04:08)
- Changelog: "switched JSON parser to fast-json-stringify; added in-process request cache (Map-based, no eviction)"

Logs:
- No FATAL/ERROR pattern; pods exit silently
- Heap profile from one captured pod shows: ~340Mi held by Map keyed on request ID, never cleared

On-call:
- 09:12 — first OOMKilled pod
- 09:14 — pager
- 09:16 — service still serving most traffic via remaining pods; checkout success rate dipped from 99.7% → 96.8%

Summary

SEV2

order-svc pods are being OOMKilled every ~20 minutes due to a memory leak introduced in v3.7. A new in-process request cache (Map-based, no eviction) is holding ~340Mi of heap, causing pods to hit the 512Mi limit and crash. 3 of 12 pods are in CrashLoopBackOff, p99 latency is degraded, and checkout success rate dropped from 99.7% to 96.8%.

Severity reasoning: User-facing degradation (3% request timeouts, checkout success rate drop) but not a full outage; revenue impact likely but limited. Scope: subset of users hitting degraded pods. Reversibility: rollback or feature flag disable can stop the leak.

deepseek-chat·prompt v1·output: en·10499ms·1375↑ / 1454↓ tok·$0.00197

Root cause hypotheses

  • highIn-process request cache (Map-based, no eviction) introduced in v3.7 grows unbounded, consuming all heap memory.

    Evidence: Heap profile from captured pod shows ~340Mi held by Map keyed on request ID, never cleared. Memory trend shows linear growth from 180Mi to 500Mi over 5 hours post-deploy.

  • mediumJSON parser switch to fast-json-stringify causes memory fragmentation or retention of large strings.

    Evidence: Changelog mentions parser switch; however heap profile points to Map, not string buffers. Likely secondary factor.

  • lowNode-level memory pressure or cgroup misconfiguration causing early OOM.

    Evidence: Memory limit is 512Mi, pods reach 500Mi then crash. Other pods on same node not affected. Unlikely.

Investigation checklist

  1. Confirm the memory leak source by inspecting heap dump from a running pod.
    kubectl exec -n <namespace> <pod-name> -- jcmd 1 GC.heap_dump /tmp/heap.hprof && kubectl cp <namespace>/<pod-name>:/tmp/heap.hprof ./heap.hprof

    Expected: Heap dump shows Map entries with request IDs as keys, no eviction logic.

  2. Check if the cache is configurable via feature flag or environment variable.
    kubectl get deployment order-svc -o yaml | grep -i -E 'feature|flag|cache|evict'

    Expected: No feature flag found, or flag exists but is enabled.

  3. Verify memory usage trend across all pods to confirm leak is uniform.
    kubectl top pod -n <namespace> --sort-by=memory | grep order-svc

    Expected: All pods show memory usage increasing over time, similar slope.

  4. Check if rolling back to v3.6 resolves the issue in a canary pod.
    kubectl set image deployment/order-svc-canary order-svc=order-svc:v3.6

    Expected: Canary pod memory stabilizes at ~200Mi.

Mitigation plan

  • Disable the in-process request cache via feature flag if available, or roll back to v3.6.

    Risk: Rollback may cause brief connection draining; feature flag disable may increase latency for uncached requests.

    Rollback: Re-enable cache flag or redeploy v3.7.

  • Increase memory limit to 1Gi temporarily to buy time for root cause fix.

    Risk: Higher memory usage may cause node pressure; ensure cluster capacity.

    Rollback: Revert limit to 512Mi after fix.

  • Add a TTL-based eviction to the cache and redeploy as v3.7.1.

    Risk: New code may introduce bugs; test in staging first.

    Rollback: Revert to v3.6.

Customer impact

Approximately 3% of checkout requests are timing out, and p99 latency is degraded. Users may see intermittent errors during checkout. No data loss. ETA for full mitigation: 1 hour for rollback, 4 hours for patched release.

Postmortem draft

Postmortem: order-svc OOMKilled

Summary: order-svc v3.7 introduced a memory leak via an unbounded in-process request cache, causing pods to OOM every ~20 minutes.

Timeline:

  • 04:00 UTC: v3.7 deployed
  • 09:12: First OOMKilled pod
  • 09:14: Pager triggered
  • 09:16: Incident declared
  • [Mitigation time]: Rollback to v3.6 / cache disabled

Impact: 3% request timeouts, checkout success rate 96.8% (down from 99.7%), 3 pods in CrashLoopBackOff.

Root Cause: Map-based request cache without eviction grew unbounded, consuming heap until OOM.

What Went Well: Heap profile captured quickly; memory trend visible.

What Went Poorly: No feature flag for cache; no memory limit alert before crash; code review missed missing eviction.

Action Items:

  • [P0] Add TTL eviction to cache
  • [P1] Add memory usage alert at 80% of limit
  • [P1] Require feature flags for new in-memory caches
  • [P2] Review code for similar patterns

Follow-ups

  • P0Add TTL eviction to request cache and release v3.7.1service owner
  • P1Set up memory usage alert at 80% of limit for order-svcon-call SRE
  • P1Require feature flags for any new in-memory cache in code reviewplatform team
  • P2Review all services for similar unbounded cachesservice owner