← All incidents

[Eval][v3][en] Order service OOM crashloop following v3.7 deploy

service: order-svccreated: 6/2/2026, 11:54:34 PM

Raw incident context

Time: 09:15 UTC. order-svc pods entering CrashLoopBackOff with OOMKilled exit reason.

Kubernetes:
- Pods: 12 total, 3 currently OOMKilled, 9 running
- Memory limit: 512Mi, requests: 256Mi
- Recent restarts/hour: 18 (up from baseline of 0–1)
- HPA scaling has hit max replicas (12)

Memory trend (last 24h):
- Steady linear growth from 180Mi at 04:00 UTC to 500Mi at 09:00 UTC, then crash
- Pre-deploy baseline: stable at 200Mi indefinitely

Deploy history:
- 04:00 UTC — order-svc v3.7 deployed (rolling, 100% complete by 04:08)
- Changelog: "switched JSON parser to fast-json-stringify; added in-process request cache (Map-based, no eviction)"

Logs:
- No FATAL/ERROR pattern; pods exit silently
- Heap profile from one captured pod shows: ~340Mi held by Map keyed on request ID, never cleared

On-call:
- 09:12 — first OOMKilled pod
- 09:14 — pager
- 09:16 — service still serving most traffic via remaining pods; checkout success rate dipped from 99.7% → 96.8%

Summary

SEV1

order-svc pods are being OOMKilled every ~20 minutes due to a memory leak introduced in v3.7, which added an in-process request cache (Map-based, no eviction). The leak causes linear memory growth from 180Mi to 500Mi over 5 hours, hitting the 512Mi limit. 3 of 12 pods are in CrashLoopBackOff, p99 latency is degraded, and checkout success rate dropped from 99.7% to 96.8% (~3% timeout rate).

Severity reasoning: User-facing outage: checkout success rate dropped from 99.7% to 96.8% (>1% error rate for >5 minutes), revenue path broken (checkout failures), and error rate >1% for >5 min. This meets SEV1 criteria.

deepseek-chat·prompt v3·output: en·10920ms·2007↑ / 1707↓ tok·$0.00242

Root cause hypotheses

  • highIn-process request cache (Map-based, no eviction) added in v3.7 leaks memory by never releasing entries.

    Evidence: Heap profile shows ~340Mi held by Map keyed on request ID, never cleared. Memory grows linearly from 180Mi to 500Mi over 5 hours post-deploy.

  • lowfast-json-stringify library has a memory leak in certain serialization patterns.

    Evidence: No direct evidence; changelog mentions switching parser. However, heap profile points to Map, not string buffers.

  • mediumKubernetes memory limit (512Mi) is too low for the new cache behavior, but baseline was stable at 200Mi before.

    Evidence: Memory limit unchanged; pre-deploy stable at 200Mi. The leak is the root cause, not the limit.

Investigation checklist

  1. Confirm memory leak source by examining heap dump from a running pod.
    kubectl exec -n prod -l app=order-svc -- jcmd 1 GC.heap_dump /tmp/heap.hprof && kubectl cp prod/$(kubectl get pod -n prod -l app=order-svc -o jsonpath='{.items[0].metadata.name}'):/tmp/heap.hprof ./heap.hprof

    Expected: Heap dump shows Map entries with request IDs, no eviction logic.

  2. Check memory usage trend across all pods to confirm linear growth.
    kubectl top pod -n prod -l app=order-svc --sort-by=memory | head -20

    Expected: Memory usage increasing over time, with pods near limit restarting.

  3. Verify that the cache is the culprit by checking code for eviction logic.
    kubectl exec -n prod -l app=order-svc -- cat /app/src/cache.js | grep -i 'evict\|delete\|clear\|ttl\|max'

    Expected: No eviction logic found; Map is unbounded.

  4. Check if fast-json-stringify has known memory issues.
    kubectl exec -n prod -l app=order-svc -- npm ls fast-json-stringify 2>/dev/null; curl -s 'https://registry.npmjs.org/fast-json-stringify' | jq '.versions | keys' | tail -5

    Expected: Version in use; no known CVEs but can check changelog.

Mitigation plan

  • Rollback order-svc to v3.6 (pre-deploy version) to remove the leaky cache.

    Risk: Rollback is safe; v3.6 was stable. No destructive ops. Brief traffic disruption during rollout.

    Rollback: Re-deploy v3.7 if rollback causes issues (unlikely).

  • If rollback is not immediate, increase memory limit to 1Gi to buy time.

    Risk: Temporary; may mask leak and cause resource contention on node.

    Rollback: Revert limit to 512Mi after fix.

  • Add a TTL-based eviction to the cache or switch to an external cache (e.g., Redis).

    Risk: Requires code change and redeploy; not immediate.

    Rollback: Revert to v3.6 if new version has issues.

Customer impact

Approximately 3% of checkout requests are timing out or failing. Users may see errors when placing orders. The issue started around 09:15 UTC and is ongoing. Estimated affected users: unknown, but checkout success rate dropped from 99.7% to 96.8%.

Postmortem draft

Summary

order-svc v3.7 introduced a memory leak via an unbounded in-process request cache, causing OOMKilled pods and degraded checkout success rate.

Timeline (UTC)

  • 04:00 - Deploy v3.7 (rolling, complete 04:08)
  • 04:00-09:00 - Memory grows linearly from 180Mi to 500Mi
  • 09:12 - First OOMKilled pod
  • 09:14 - Pager triggered
  • 09:16 - On-call acknowledges; checkout success rate at 96.8%
  • [FILL IN] - Rollback initiated
  • [FILL IN] - Service stable

Impact

  • 3 of 12 pods in CrashLoopBackOff
  • Checkout success rate: 99.7% → 96.8%
  • p99 latency degraded

Root Cause

Unbounded Map-based request cache added in v3.7 with no eviction policy. Each request adds an entry, never removed, causing memory exhaustion.

Detection

Alerted by OOMKilled pod events and pager. No gradual memory alert existed.

Response

  • Rollback to v3.6
  • Increased memory limit temporarily

What Went Well

  • Quick detection via pager
  • Heap dump analysis identified leak

What Went Poorly

  • No memory leak detection in staging
  • No eviction policy in cache design
  • No gradual memory usage alert

Action Items

  • Add eviction to cache (TTL or LRU)
  • Add memory usage alert at 80% limit
  • Add load test with sustained traffic
  • Code review for unbounded data structures

Follow-ups

  • P0Add eviction policy to in-process cache (TTL or LRU) and redeploy.order-svc team
  • P1Add alert on pod memory usage >80% of limit.platform team
  • P1Add load test with sustained traffic to detect memory leaks pre-deploy.QA team
  • P2Review changelog for similar unbounded data structures in other services.SRE team
  • P2Consider external cache (Redis) for request caching to avoid in-process memory issues.order-svc team