← All incidents

[Eval][v1][en] Order service OOM crashloop following v3.7 deploy

service: order-svccreated: 5/25/2026, 10:20:36 PM

Raw incident context

Time: 09:15 UTC. order-svc pods entering CrashLoopBackOff with OOMKilled exit reason.

Kubernetes:
- Pods: 12 total, 3 currently OOMKilled, 9 running
- Memory limit: 512Mi, requests: 256Mi
- Recent restarts/hour: 18 (up from baseline of 0–1)
- HPA scaling has hit max replicas (12)

Memory trend (last 24h):
- Steady linear growth from 180Mi at 04:00 UTC to 500Mi at 09:00 UTC, then crash
- Pre-deploy baseline: stable at 200Mi indefinitely

Deploy history:
- 04:00 UTC — order-svc v3.7 deployed (rolling, 100% complete by 04:08)
- Changelog: "switched JSON parser to fast-json-stringify; added in-process request cache (Map-based, no eviction)"

Logs:
- No FATAL/ERROR pattern; pods exit silently
- Heap profile from one captured pod shows: ~340Mi held by Map keyed on request ID, never cleared

On-call:
- 09:12 — first OOMKilled pod
- 09:14 — pager
- 09:16 — service still serving most traffic via remaining pods; checkout success rate dipped from 99.7% → 96.8%

Summary

SEV2

order-svc pods are OOM-killed every ~20 minutes due to a memory leak introduced in v3.7. A new in-process request cache (Map-based, no eviction) holds ~340Mi of heap, causing pods to hit the 512Mi limit. 3 of 12 pods are in CrashLoopBackOff; p99 latency degraded and ~3% of requests timeout. The incident is ongoing.

Severity reasoning: User-facing degradation (96.8% success rate, p99 latency increase) but not a full outage; revenue impact possible but limited. Scope: subset of users hitting degraded pods. Reversibility: rollback or fix can restore service.

deepseek-chat·prompt v1·output: en·10555ms

Root cause hypotheses

  • highIn-process request cache introduced in v3.7 has no eviction policy, causing unbounded memory growth.

    Evidence: Heap profile shows ~340Mi held by Map keyed on request ID; code change added Map-based cache without eviction.

  • mediumMemory limit (512Mi) is too low for the new cache, but cache is the primary driver.

    Evidence: Pre-deploy baseline stable at 200Mi; after deploy memory grows linearly to 500Mi then OOM.

  • lowJSON parser switch (fast-json-stringify) causes memory fragmentation or leaks.

    Evidence: No direct evidence; heap profile points to Map, not JSON buffers.

Investigation checklist

  1. Confirm the cache is the source of memory growth by inspecting heap dump from a running pod.
    kubectl exec -n <namespace> <pod-name> -- jcmd 1 GC.heap_dump /tmp/heap.hprof && kubectl cp <namespace>/<pod-name>:/tmp/heap.hprof ./heap.hprof && jhat heap.hprof | grep -i 'Map'

    Expected: Large Map object (keyed by request ID) consuming >300Mi.

  2. Check if the cache has any eviction or size limit in code.
    kubectl exec -n <namespace> <pod-name> -- cat /app/index.js | grep -E 'Map|evict|max|limit'

    Expected: No eviction logic found; Map is unbounded.

  3. Verify memory limit and request settings for the deployment.
    kubectl get deployment order-svc -n <namespace> -o json | jq '.spec.template.spec.containers[0].resources'

    Expected: limits.memory: 512Mi, requests.memory: 256Mi

  4. Check if the cache is shared across requests (global variable) or per-request.
    kubectl exec -n <namespace> <pod-name> -- cat /app/index.js | grep -E 'const cache|let cache|var cache'

    Expected: Global Map variable, not scoped per request.

Mitigation plan

  • Rollback order-svc to v3.6 (previous stable version) to stop memory leak immediately.

    Risk: Brief traffic disruption during rollback; no data loss as cache is in-memory only.

    Rollback: Re-deploy v3.7 if rollback causes issues, or re-apply v3.7 after fix.

  • If rollback not possible, increase memory limit to 1Gi temporarily to buy time.

    Risk: Higher resource usage; may cause node pressure. Does not fix leak.

    Rollback: Revert memory limit to 512Mi after fix is deployed.

  • Disable the cache via feature flag or environment variable if available.

    Risk: May increase latency if cache was reducing load; no data loss.

    Rollback: Re-enable cache after fix.

Customer impact

Approximately 3% of checkout requests are timing out or failing. Users may see errors during checkout. Service is partially degraded; remaining pods handle most traffic.

Postmortem draft

Postmortem: order-svc OOMKilled due to unbounded cache

Summary

order-svc v3.7 introduced an in-process request cache without eviction, causing memory growth and OOM kills.

Timeline

  • 04:00 UTC: v3.7 deployed
  • 09:12 UTC: first OOMKilled pod
  • 09:14 UTC: pager triggered
  • 09:16 UTC: incident declared
  • [Mitigation time]: rollback to v3.6

Impact

  • 3 of 12 pods in CrashLoopBackOff
  • Checkout success rate: 99.7% → 96.8%
  • p99 latency degraded

Root Cause

Unbounded Map cache in order-svc v3.7 (commit [hash]). No eviction or size limit.

What Went Well

  • Monitoring detected OOM quickly
  • Heap dump captured before crash

What Went Poorly

  • Code review missed missing eviction
  • No memory regression test before deploy

Action Items

  • [ ] Add eviction policy (TTL or LRU) to cache
  • [ ] Add memory limit alerts at 80% of limit
  • [ ] Add load test with sustained traffic
  • [ ] Enforce code review checklist for resource management

Follow-ups

  • P0Add eviction policy (TTL or LRU) to the request cache in order-svc.service owner
  • P1Add memory usage alert at 80% of container limit.on-call SRE
  • P1Add load test with sustained traffic to detect memory leaks before deploy.platform team
  • P2Update code review checklist to include resource management (caches, connections).service owner