← All incidents

[Eval][v1][en] Order service OOM crashloop following v3.7 deploy

service: order-svccreated: 6/2/2026, 11:51:16 PM

Raw incident context

Time: 09:15 UTC. order-svc pods entering CrashLoopBackOff with OOMKilled exit reason.

Kubernetes:
- Pods: 12 total, 3 currently OOMKilled, 9 running
- Memory limit: 512Mi, requests: 256Mi
- Recent restarts/hour: 18 (up from baseline of 0–1)
- HPA scaling has hit max replicas (12)

Memory trend (last 24h):
- Steady linear growth from 180Mi at 04:00 UTC to 500Mi at 09:00 UTC, then crash
- Pre-deploy baseline: stable at 200Mi indefinitely

Deploy history:
- 04:00 UTC — order-svc v3.7 deployed (rolling, 100% complete by 04:08)
- Changelog: "switched JSON parser to fast-json-stringify; added in-process request cache (Map-based, no eviction)"

Logs:
- No FATAL/ERROR pattern; pods exit silently
- Heap profile from one captured pod shows: ~340Mi held by Map keyed on request ID, never cleared

On-call:
- 09:12 — first OOMKilled pod
- 09:14 — pager
- 09:16 — service still serving most traffic via remaining pods; checkout success rate dipped from 99.7% → 96.8%

Summary

SEV2

order-svc pods are being OOMKilled every ~20 minutes due to a memory leak introduced in v3.7. The leak is caused by an in-process request cache (Map-based, no eviction) that grows unboundedly, consuming ~340Mi per pod. 3 of 12 pods are in CrashLoopBackOff, p99 latency is degraded, and checkout success rate dropped from 99.7% to 96.8%.

Severity reasoning: User-facing degradation (3% request timeouts, checkout success rate drop) but not a full outage; revenue impact likely but limited. Scope: subset of users hitting affected pods. Reversibility: rollback or cache fix can restore service.

deepseek-chat·prompt v1·output: en·10984ms·1375↑ / 1550↓ tok·$0.00208

Root cause hypotheses

  • highIn-process request cache (Map) introduced in v3.7 has no eviction policy, causing unbounded memory growth.

    Evidence: Heap profile shows ~340Mi held by Map keyed on request ID; memory grows linearly from 180Mi to 500Mi over 5 hours post-deploy.

  • mediumfast-json-stringify library has a memory leak in certain payload patterns.

    Evidence: Changelog mentions switching JSON parser; no direct evidence of leak, but possible. Memory growth correlates with deploy time.

  • lowNode-level memory pressure or kernel bug causing OOM kills at lower thresholds.

    Evidence: Memory limit is 512Mi; pods consistently crash at ~500Mi. No other workloads on node show similar behavior.

Investigation checklist

  1. Check memory usage per pod and identify pods with highest memory.
    kubectl top pod -n <namespace> --selector=app=order-svc --sort-by=memory

    Expected: Pods near 500Mi memory usage; running pods show >400Mi.

  2. Verify the in-process cache is the source of memory leak by inspecting heap dump from a running pod.
    kubectl exec <pod> -- jcmd 1 GC.heap_dump /tmp/heap.hprof && kubectl cp <pod>:/tmp/heap.hprof ./heap.hprof

    Expected: Heap dump shows Map entries with request IDs, no eviction logic.

  3. Check if the cache has a size limit or TTL in code.
    grep -r 'Map' /path/to/order-svc/src/ --include='*.js' | grep -i 'cache'

    Expected: No eviction or size limit found; Map is used without cleanup.

  4. Confirm the deploy time correlation with memory growth.
    kubectl rollout history deployment/order-svc

    Expected: Revision 3.7 deployed at 04:00 UTC.

  5. Check if fast-json-stringify has known memory issues.
    npm view fast-json-stringify versions --json | tail -5

    Expected: Version used is latest; no known CVEs but check GitHub issues.

Mitigation plan

  • Rollback order-svc to previous version v3.6 to immediately stop memory leak.

    Risk: Rollback may cause brief downtime or incompatibility with downstream services; rolling update is safe.

    Rollback: Re-deploy v3.7 using 'kubectl rollout undo deployment/order-svc'.

  • If rollback not possible, increase memory limit to 1Gi as temporary workaround.

    Risk: Higher memory usage may cause node pressure; not a fix for leak.

    Rollback: Revert memory limit to 512Mi after fix.

  • Scale up HPA max replicas to 15 to distribute load while pods restart.

    Risk: More pods may increase database connection pool pressure.

    Rollback: Revert HPA max replicas to 12 after stabilization.

Customer impact

Approximately 3% of checkout requests are timing out or failing. Users may see errors during payment processing. Service is partially degraded; no data loss expected.

Postmortem draft

Postmortem: order-svc OOMKilled due to memory leak

Summary

  • Date: YYYY-MM-DD
  • Duration: 09:12 UTC - resolved at XX:XX UTC
  • Severity: SEV2

Timeline

  • 04:00 UTC: Deploy v3.7 (in-process cache, JSON parser change)
  • 09:12 UTC: First OOMKilled pod
  • 09:14 UTC: Pager triggered
  • 09:16 UTC: On-call acknowledges
  • XX:XX UTC: Rollback initiated
  • XX:XX UTC: Service stable

Impact

  • 3% request timeouts
  • Checkout success rate: 99.7% → 96.8%
  • 3 pods in CrashLoopBackOff

Root Cause

In-process request cache (Map) with no eviction policy introduced in v3.7 caused unbounded memory growth.

What Went Well

  • Monitoring detected OOMKilled quickly
  • Heap dump analysis identified root cause

What Went Poorly

  • Code review missed missing eviction logic
  • No memory leak tests in CI

Action Items

  • [ ] Add eviction policy to cache (TTL or LRU)
  • [ ] Add memory leak detection tests
  • [ ] Review code for similar patterns
  • [ ] Add memory usage alerting

Blameless Conclusion

This was a process gap: no memory leak testing and insufficient code review for caching logic.

Follow-ups

  • P0Add eviction policy (TTL or LRU) to in-process cache in order-svc.service owner
  • P1Add memory leak detection tests (e.g., heap snapshot comparison) to CI pipeline.platform team
  • P1Review all services for similar unbounded caches.service owner
  • P2Set up memory usage alerting at 80% of limit for all services.on-call SRE
  • P2Update deployment checklist to include memory profiling before rollout.platform team