← All incidents

[Eval][v3][en] Order service OOM crashloop following v3.7 deploy

service: order-svccreated: 6/2/2026, 11:54:19 PM

Raw incident context

Time: 09:15 UTC. order-svc pods entering CrashLoopBackOff with OOMKilled exit reason.

Kubernetes:
- Pods: 12 total, 3 currently OOMKilled, 9 running
- Memory limit: 512Mi, requests: 256Mi
- Recent restarts/hour: 18 (up from baseline of 0–1)
- HPA scaling has hit max replicas (12)

Memory trend (last 24h):
- Steady linear growth from 180Mi at 04:00 UTC to 500Mi at 09:00 UTC, then crash
- Pre-deploy baseline: stable at 200Mi indefinitely

Deploy history:
- 04:00 UTC — order-svc v3.7 deployed (rolling, 100% complete by 04:08)
- Changelog: "switched JSON parser to fast-json-stringify; added in-process request cache (Map-based, no eviction)"

Logs:
- No FATAL/ERROR pattern; pods exit silently
- Heap profile from one captured pod shows: ~340Mi held by Map keyed on request ID, never cleared

On-call:
- 09:12 — first OOMKilled pod
- 09:14 — pager
- 09:16 — service still serving most traffic via remaining pods; checkout success rate dipped from 99.7% → 96.8%

Summary

SEV1

order-svc pods are being OOMKilled every ~20 minutes due to a memory leak introduced in v3.7, which added an in-process request cache (Map-based, no eviction) that grows unboundedly. The leak causes pods to crash, leading to a restart loop and degraded p99 latency with ~3% request timeouts. The incident is ongoing with 3 of 12 pods currently down.

Severity reasoning: User-facing outage: error rate > 1% for >5 min (checkout success rate dropped from 99.7% to 96.8%, exceeding 1% threshold). Revenue path broken (checkout failures). SEV1 criteria met.

deepseek-chat·prompt v3·output: en·10271ms·2007↑ / 1506↓ tok·$0.00220

Root cause hypotheses

  • highIn-process request cache (Map-based, no eviction) added in v3.7 leaks memory by never releasing entries.

    Evidence: Heap profile shows ~340Mi held by Map keyed on request ID, never cleared. Changelog confirms 'added in-process request cache (Map-based, no eviction)'. Memory grows linearly from 180Mi to 500Mi over 5 hours.

  • lowJSON parser switch to fast-json-stringify causes memory fragmentation or retention.

    Evidence: Changelog mentions parser change. However, heap profile points to Map as primary consumer. No direct evidence of parser leak.

  • lowExternal dependency (e.g., database connection pool) not releasing resources, causing memory growth.

    Evidence: No evidence of connection pool issues in logs. Memory growth correlates with deployment time, not external factors.

Investigation checklist

  1. Confirm memory leak source by examining heap dump from a running pod.
    kubectl exec -n prod order-svc-pod-name -- jcmd 1 GC.heap_dump /tmp/heap.hprof && kubectl cp prod/order-svc-pod-name:/tmp/heap.hprof ./heap.hprof

    Expected: Heap dump shows Map with many entries; no eviction logic.

  2. Check memory usage trend across all pods to confirm leak is widespread.
    kubectl top pod -n prod -l app=order-svc --sort-by=memory | head -20

    Expected: All pods show memory usage >400Mi, growing over time.

  3. Verify that the cache is the only significant memory consumer by analyzing heap histogram.
    kubectl exec -n prod order-svc-pod-name -- jcmd 1 GC.class_histogram | grep -i map

    Expected: Map entries dominate memory usage.

  4. Check if there is a feature flag to disable the cache.
    kubectl get configmap order-svc-config -n prod -o yaml | grep -i cache

    Expected: Config may have a flag like 'cache.enabled: true'.

Mitigation plan

  • Immediately disable the in-process cache via feature flag or environment variable, then restart pods.

    Risk: Temporarily increases latency due to cache miss, but prevents OOM. No data loss.

    Rollback: Re-enable cache by reverting config change and restarting pods.

  • If no feature flag, roll back order-svc to v3.6 (previous version) using kubectl rollout undo.

    Risk: Rollback may cause brief downtime during pod replacement. Safe operation.

    Rollback: Re-apply v3.7 deployment if needed.

  • Increase memory limit to 1Gi as a temporary buffer to reduce crash frequency.

    Risk: May mask the leak and increase resource usage. Could cause node pressure.

    Rollback: Revert memory limit to 512Mi.

Customer impact

Approximately 3% of checkout requests are timing out, affecting users trying to place orders. The service is partially degraded; some users may see errors or delays. No data loss has occurred.

Postmortem draft

Postmortem: order-svc OOMKilled due to memory leak

Summary

On [DATE], order-svc v3.7 was deployed at 04:00 UTC, introducing an in-process request cache without eviction. This caused a memory leak leading to OOMKilled pods starting at 09:12 UTC. The incident was mitigated by [MITIGATION].

Timeline (UTC)

  • 04:00 - v3.7 deployed
  • 09:12 - First OOMKilled pod
  • 09:14 - Pager triggered
  • 09:16 - Checkout success rate drops to 96.8%
  • [FILL IN] - Mitigation applied
  • [FILL IN] - All pods stable

Impact

  • 3% of checkout requests timed out
  • 3 of 12 pods were OOMKilled at peak
  • p99 latency degraded

Root Cause

A Map-based request cache added in v3.7 had no eviction policy, causing unbounded memory growth.

Detection

Alerted by OOMKilled pod events and success rate drop. No prior warning from memory trend.

Response

  • [FILL IN] steps taken

What Went Well

  • Quick detection via pager
  • Heap dump analysis identified root cause

What Went Poorly

  • No memory limit monitoring before crash
  • Cache added without review of memory implications

Action Items

  • Add eviction policy to cache (LRU or TTL)
  • Set up memory usage alerts at 80% of limit
  • Add load test with sustained traffic before deploying caching changes

Follow-ups

  • P0Add eviction policy to in-process cache (LRU or TTL-based).order-svc team
  • P1Set up memory usage alert at 80% of pod memory limit.platform team
  • P1Add load test with sustained traffic to detect memory leaks before deployment.QA team
  • P2Review deployment process to include memory profiling for changes that add caching.platform team