← All incidents

[Eval][v3][en] Order service OOM crashloop following v3.7 deploy

service: order-svccreated: 6/2/2026, 11:37:29 PM

Raw incident context

Time: 09:15 UTC. order-svc pods entering CrashLoopBackOff with OOMKilled exit reason.

Kubernetes:
- Pods: 12 total, 3 currently OOMKilled, 9 running
- Memory limit: 512Mi, requests: 256Mi
- Recent restarts/hour: 18 (up from baseline of 0–1)
- HPA scaling has hit max replicas (12)

Memory trend (last 24h):
- Steady linear growth from 180Mi at 04:00 UTC to 500Mi at 09:00 UTC, then crash
- Pre-deploy baseline: stable at 200Mi indefinitely

Deploy history:
- 04:00 UTC — order-svc v3.7 deployed (rolling, 100% complete by 04:08)
- Changelog: "switched JSON parser to fast-json-stringify; added in-process request cache (Map-based, no eviction)"

Logs:
- No FATAL/ERROR pattern; pods exit silently
- Heap profile from one captured pod shows: ~340Mi held by Map keyed on request ID, never cleared

On-call:
- 09:12 — first OOMKilled pod
- 09:14 — pager
- 09:16 — service still serving most traffic via remaining pods; checkout success rate dipped from 99.7% → 96.8%

Summary

SEV1

order-svc pods are being OOMKilled every ~20 minutes due to a memory leak introduced in v3.7. A new in-process request cache (Map-based, no eviction) is holding ~340Mi of heap, causing pods to hit the 512Mi memory limit and crash. 3 of 12 pods are in CrashLoopBackOff, p99 latency is degraded, and checkout success rate dropped from 99.7% to 96.8%.

Severity reasoning: User-facing outage: checkout success rate dropped from 99.7% to 96.8% (error rate >1% for >5 min). Revenue path is broken (checkout failures). This meets SEV1 criteria.

deepseek-chat·prompt v3·output: en·10290ms·2007↑ / 1580↓ tok·$0.00228

Root cause hypotheses

  • highMemory leak from in-process request cache introduced in v3.7 (Map-based, no eviction).

    Evidence: Heap profile shows ~340Mi held by Map keyed on request ID, never cleared. Memory grows linearly from 180Mi at 04:00 UTC to 500Mi at 09:00 UTC, then crashes. Changelog confirms cache added at 04:00 UTC.

  • mediumIncreased traffic causing more cache entries, exacerbating leak.

    Evidence: Memory growth correlates with time since deploy, not traffic spikes. No traffic surge noted. Likely secondary factor.

  • lowMemory limit too low for new cache overhead.

    Evidence: Pre-deploy baseline was 200Mi stable. New cache adds ~340Mi, exceeding 512Mi limit. However, cache should have eviction; leak is the primary issue.

Investigation checklist

  1. Confirm memory leak by checking heap dump from a running pod.
    kubectl exec -n prod -l app=order-svc -- jcmd 1 GC.heap_dump /tmp/heap.hprof && kubectl cp prod/$(kubectl get pod -n prod -l app=order-svc -o jsonpath='{.items[0].metadata.name}'):/tmp/heap.hprof ./heap.hprof

    Expected: Heap dump shows Map with many entries and no eviction logic.

  2. Check memory usage trend for all pods.
    kubectl top pod -n prod -l app=order-svc --sort-by=memory

    Expected: Running pods show memory usage near 500Mi, increasing over time.

  3. Verify cache implementation in v3.7 source code.
    kubectl exec -n prod -l app=order-svc -- cat /app/app.js | grep -A 20 'cache'

    Expected: Map-based cache without eviction or size limit.

  4. Check if rolling back to v3.6 resolves the issue.
    kubectl rollout undo deployment/order-svc -n prod --to-revision=2

    Expected: Pods stop OOMKilling, memory stabilizes at ~200Mi.

Mitigation plan

  • Rollback order-svc to v3.6 (previous stable version).

    Risk: Brief period of mixed versions during rollout; potential cache invalidation issues. Low risk.

    Rollback: Re-apply v3.7 deployment: kubectl rollout undo deployment/order-svc -n prod --to-revision=3

  • If rollback not possible, increase memory limit to 1Gi temporarily.

    Risk: May mask leak; could cause node memory pressure. Higher blast radius.

    Rollback: Revert limit to 512Mi: kubectl set resources deployment/order-svc -n prod --limits=memory=512Mi

  • Add eviction policy to cache (TTL or LRU) and redeploy as v3.8.

    Risk: Requires code change and deployment; time to implement.

    Rollback: Rollback to v3.6 if v3.8 fails.

Customer impact

Approximately 3% of checkout requests are timing out or failing. Users may see errors during payment processing. The issue started around 09:15 UTC and is ongoing.

Postmortem draft

Summary

order-svc experienced OOMKilled crashes due to a memory leak from an unbounded in-process request cache introduced in v3.7. Checkout success rate dropped from 99.7% to 96.8%.

Timeline (UTC)

  • 04:00 - v3.7 deployed with new cache
  • 09:12 - First OOMKilled pod
  • 09:14 - Pager triggered
  • 09:16 - Incident declared
  • [FILL IN] - Rollback initiated
  • [FILL IN] - Service stabilized

Impact

  • 3 of 12 pods in CrashLoopBackOff
  • Checkout success rate: 96.8% (vs 99.7% baseline)
  • p99 latency degraded

Root Cause

Unbounded Map-based request cache without eviction, causing memory to grow linearly until OOM.

Detection

Alerted by OOMKilled pod events and pager. No prior warning from memory trend (linear growth missed).

Response

  • Rollback to v3.6
  • Increased memory limit temporarily

What Went Well

  • Quick detection via pod crash alerts
  • Rollback was straightforward

What Went Poorly

  • No memory leak detection in staging
  • Cache added without review of memory constraints

Action Items

  • [ ] Add eviction to cache (P0)
  • [ ] Add memory leak detection tests (P1)
  • [ ] Set up memory usage alerts at 80% limit (P1)

Follow-ups

  • P0Add eviction policy (TTL/LRU) to in-process cache in order-svc.service owner
  • P1Add memory leak detection test to CI pipeline (heap dump comparison).platform team
  • P1Set up alert for memory usage >80% of limit on order-svc pods.on-call SRE
  • P2Review all in-process caches across services for unbounded growth.platform team
  • P2Update deployment checklist to include memory profiling for new features.service owner