← All incidents

[Eval][v2][en] Order service OOM crashloop following v3.7 deploy

service: order-svccreated: 5/25/2026, 10:21:02 PM

Raw incident context

Time: 09:15 UTC. order-svc pods entering CrashLoopBackOff with OOMKilled exit reason.

Kubernetes:
- Pods: 12 total, 3 currently OOMKilled, 9 running
- Memory limit: 512Mi, requests: 256Mi
- Recent restarts/hour: 18 (up from baseline of 0–1)
- HPA scaling has hit max replicas (12)

Memory trend (last 24h):
- Steady linear growth from 180Mi at 04:00 UTC to 500Mi at 09:00 UTC, then crash
- Pre-deploy baseline: stable at 200Mi indefinitely

Deploy history:
- 04:00 UTC — order-svc v3.7 deployed (rolling, 100% complete by 04:08)
- Changelog: "switched JSON parser to fast-json-stringify; added in-process request cache (Map-based, no eviction)"

Logs:
- No FATAL/ERROR pattern; pods exit silently
- Heap profile from one captured pod shows: ~340Mi held by Map keyed on request ID, never cleared

On-call:
- 09:12 — first OOMKilled pod
- 09:14 — pager
- 09:16 — service still serving most traffic via remaining pods; checkout success rate dipped from 99.7% → 96.8%

Summary

SEV1

order-svc pods are being OOMKilled every ~20 minutes due to a memory leak introduced in v3.7. The leak is caused by an in-process request cache (Map-based, no eviction) that grows unboundedly. 3 of 12 pods are in CrashLoopBackOff, p99 latency is degraded, and checkout success rate dropped from 99.7% to 96.8%. Immediate mitigation is to roll back the deployment or reduce memory pressure.

Severity reasoning: User-facing outage: error rate > 1% (3% requests timing out, checkout success rate 96.8% < 99%) for >5 min, revenue path broken (checkout failures), and risk of full outage if remaining pods also OOM.

deepseek-chat·prompt v2·output: en·11936ms

Root cause hypotheses

  • highIn-process request cache (Map-based, no eviction) introduced in v3.7 causes unbounded memory growth.

    Evidence: Heap profile from captured pod shows ~340Mi held by Map keyed on request ID, never cleared. Memory trend shows linear growth from 180Mi to 500Mi over 5 hours, then crash.

  • mediumMemory limit (512Mi) is too low for the new JSON parser (fast-json-stringify) which may allocate more memory per request.

    Evidence: Memory baseline was stable at 200Mi before v3.7. After deploy, memory grows to 500Mi before OOM. However, heap profile points to Map, not parser.

  • lowA sudden traffic spike increased concurrent requests, causing the cache to grow faster.

    Evidence: No traffic spike reported; HPA at max replicas suggests sustained load but not a spike. Memory growth is linear over hours, not stepwise.

Investigation checklist

  1. Check memory usage of running pods to confirm leak pattern.
    kubectl top pod -n prod -l app=order-svc --sort-by=memory --no-headers | head -10

    Expected: Memory usage near 500Mi for pods running longest, lower for recently restarted.

  2. Inspect heap dump from a captured pod to identify largest allocations.
    kubectl exec -n prod -c order-svc <pod-name> -- jcmd 1 GC.heap_dump /tmp/heap.hprof && kubectl cp -n prod <pod-name>:/tmp/heap.hprof ./heap.hprof && jhat heap.hprof

    Expected: Map entries (request ID keys) consuming ~340Mi.

  3. Check if the cache has any eviction or TTL logic in code.
    grep -r 'evict\|TTL\|expire\|clear' /deploy/order-svc-v3.7/src/cache.js

    Expected: No eviction logic found.

  4. Verify that the JSON parser change is not the primary cause by checking memory allocation per request.
    kubectl exec -n prod -c order-svc <pod-name> -- curl -s localhost:8080/debug/vars | jq '.memstats.HeapAlloc'

    Expected: HeapAlloc grows over time, not per-request spike.

  5. Check if there is a memory leak in the JSON parser by comparing before/after deploy.
    kubectl logs -n prod -l app=order-svc --since=6h | grep -i 'json\|parse' | tail -20

    Expected: No error logs; parser is not the source.

Mitigation plan

  • Roll back order-svc to v3.6 (previous stable version) to remove the memory leak.

    Risk: Rollback may cause brief downtime during pod restart. Ensure traffic drains properly.

    Rollback: Re-deploy v3.7 by running: kubectl rollout undo deployment/order-svc -n prod

  • If rollback is not immediately possible, increase memory limit to 1Gi to buy time.

    Risk: Higher limit may mask the leak and cause node memory pressure. Monitor node memory.

    Rollback: Revert memory limit to 512Mi: kubectl set resources deployment/order-svc -n prod --limits=memory=512Mi

  • Scale down HPA max replicas to 8 to reduce total memory consumption.

    Risk: May increase latency under load; could cause overload on remaining pods.

    Rollback: Set HPA max back to 12: kubectl autoscale deployment order-svc -n prod --max=12

Customer impact

Approximately 25% of users may experience timeouts or errors during checkout. Checkout success rate dropped from 99.7% to 96.8%. Affected users see 'Something went wrong' or slow page loads. No data loss expected.

Postmortem draft

Summary

[FILL IN]

Timeline (UTC)

  • 04:00 - order-svc v3.7 deployed
  • 09:12 - First OOMKilled pod
  • 09:14 - Pager triggered
  • 09:16 - Incident declared, checkout success rate at 96.8%
  • [FILL IN] - Rollback initiated
  • [FILL IN] - All pods stable

Impact

  • 3 of 12 pods in CrashLoopBackOff
  • p99 latency degraded
  • Checkout success rate dropped from 99.7% to 96.8%
  • ~3% requests timing out

Root Cause

Memory leak in in-process request cache (Map-based, no eviction) introduced in v3.7. Cache grows unboundedly until OOM.

Detection

Pager alert for OOMKilled pods. No prior warning because memory growth was gradual.

Response

  • Rolled back to v3.6
  • Increased memory limit temporarily
  • Monitored memory usage

What Went Well

  • Quick rollback decision
  • Heap dump analysis identified root cause

What Went Poorly

  • No memory limit monitoring before deploy
  • Cache was not reviewed for eviction
  • No canary deployment

Action Items

  • [FILL IN]

Follow-ups

  • P1Add memory usage alert for order-svc pods at 80% of limit.on-call SRE
  • P1Implement eviction policy (TTL or LRU) for in-process cache.service owner
  • P2Add canary deployment for future releases.platform team
  • P2Review all in-process caches across services for unbounded growth.service owner
  • P2Add heap dump collection on OOM to automate root cause analysis.platform team