← All incidents

[Eval][v1][en] Order service OOM crashloop following v3.7 deploy

service: order-svccreated: 6/2/2026, 11:36:32 PM

Raw incident context

Time: 09:15 UTC. order-svc pods entering CrashLoopBackOff with OOMKilled exit reason.

Kubernetes:
- Pods: 12 total, 3 currently OOMKilled, 9 running
- Memory limit: 512Mi, requests: 256Mi
- Recent restarts/hour: 18 (up from baseline of 0–1)
- HPA scaling has hit max replicas (12)

Memory trend (last 24h):
- Steady linear growth from 180Mi at 04:00 UTC to 500Mi at 09:00 UTC, then crash
- Pre-deploy baseline: stable at 200Mi indefinitely

Deploy history:
- 04:00 UTC — order-svc v3.7 deployed (rolling, 100% complete by 04:08)
- Changelog: "switched JSON parser to fast-json-stringify; added in-process request cache (Map-based, no eviction)"

Logs:
- No FATAL/ERROR pattern; pods exit silently
- Heap profile from one captured pod shows: ~340Mi held by Map keyed on request ID, never cleared

On-call:
- 09:12 — first OOMKilled pod
- 09:14 — pager
- 09:16 — service still serving most traffic via remaining pods; checkout success rate dipped from 99.7% → 96.8%

Summary

SEV2

order-svc pods are OOM-killed every ~20 minutes due to a memory leak introduced in v3.7. A new in-process request cache (Map-based, no eviction) holds ~340Mi of heap, causing pods to hit the 512Mi limit and crash. 3 of 12 pods are in CrashLoopBackOff; p99 latency is degraded and ~3% of requests time out.

Severity reasoning: Significant degradation for a subset of users (3% timeout, p99 latency degraded) but not a full outage; revenue impact possible but not confirmed. Scope is limited to order-svc; other services unaffected.

deepseek-chat·prompt v1·output: en·9938ms·1375↑ / 1474↓ tok·$0.00199

Root cause hypotheses

  • highMemory leak from unbounded in-process request cache introduced in v3.7

    Evidence: Heap profile shows ~340Mi held by Map keyed on request ID, never cleared. Changelog confirms 'added in-process request cache (Map-based, no eviction)'.

  • mediumIncreased request rate causing cache to grow faster than expected

    Evidence: Memory growth linear from 180Mi to 500Mi over 5 hours; pre-deploy baseline stable at 200Mi. Request rate may have increased, but cache has no eviction so it grows unbounded regardless.

  • lowMemory fragmentation due to fast-json-stringify library

    Evidence: Changelog mentions switching JSON parser to fast-json-stringify. However, heap profile points to Map, not JSON buffers.

Investigation checklist

  1. Confirm the cache is the primary memory consumer by examining heap dump from a running pod
    kubectl exec -n <namespace> <pod-name> -- jcmd 1 GC.heap_dump /tmp/heap.hprof && kubectl cp <namespace>/<pod-name>:/tmp/heap.hprof ./heap.hprof

    Expected: Heap dump shows Map with ~340Mi retained; no other large objects.

  2. Check if the cache has any eviction or size limit in code
    grep -r 'Map' --include='*.js' ./order-svc/ | grep -i 'cache'

    Expected: No eviction logic found; Map is used without size limit or TTL.

  3. Verify memory limit and request settings for pods
    kubectl describe pod -n <namespace> -l app=order-svc | grep -A2 'Limits'

    Expected: Memory limit: 512Mi, request: 256Mi.

  4. Check if HPA is at max replicas and if scaling would help
    kubectl get hpa -n <namespace> order-svc

    Expected: Current replicas: 12/12 (max). Scaling won't help because each pod leaks memory.

Mitigation plan

  • Rollback order-svc to v3.6 (pre-cache version) to stop memory leak immediately

    Risk: Brief period of no cache; may increase latency slightly but avoids OOM. Rollback is safe as v3.6 was stable.

    Rollback: Re-deploy v3.7 if rollback causes issues; but v3.6 was stable, so unlikely.

  • If rollback not possible, increase memory limit to 1Gi as temporary workaround

    Risk: Higher resource usage; may cause node pressure. Only buy time until fix is deployed.

    Rollback: Revert limit to 512Mi after cache fix is deployed.

  • Add eviction policy to cache (e.g., TTL or LRU) and deploy v3.8

    Risk: Cache miss rate may increase; but prevents unbounded growth.

    Rollback: Revert to v3.7 if cache eviction causes performance regression.

Customer impact

Approximately 3% of checkout requests are timing out; p99 latency is degraded. Users may experience slow or failed checkout. No data loss. ETA for full recovery: 30 minutes after rollback.

Postmortem draft

Postmortem: order-svc OOMKilled due to memory leak

Summary

  • Date: YYYY-MM-DD
  • Duration: 09:12 UTC – 09:42 UTC (30 min)
  • Severity: SEV2
  • Impact: 3% checkout timeout, p99 latency degraded

Timeline

  • 04:00 UTC: Deploy v3.7 (cache added)
  • 09:12 UTC: First OOMKilled pod
  • 09:14 UTC: Pager triggered
  • 09:16 UTC: Incident declared
  • 09:20 UTC: Rollback initiated
  • 09:42 UTC: All pods stable, metrics normal

Root Cause

Unbounded in-process request cache (Map) introduced in v3.7 without eviction. Cache grew until memory limit exceeded, causing OOM.

What Went Well

  • Quick detection via pager
  • Heap dump available
  • Rollback was fast

What Went Poorly

  • No code review caught missing eviction
  • No memory limit testing before deploy
  • HPA max replicas masked issue

Action Items

  • [ ] Add TTL eviction to cache (P0)
  • [ ] Add memory usage alert for order-svc (P1)
  • [ ] Add load test with memory profiling to CI (P2)

Follow-ups

  • P0Add TTL eviction to in-process cache and deploy fixservice owner
  • P1Add memory usage alert (e.g., >80% limit) for order-svcon-call SRE
  • P2Add load test with memory profiling to CI pipelineplatform team
  • P2Review code review process for missing eviction logicservice owner