← All incidents

[Eval][v2][en] Order service OOM crashloop following v3.7 deploy

service: order-svccreated: 6/2/2026, 11:52:28 PM

Raw incident context

Time: 09:15 UTC. order-svc pods entering CrashLoopBackOff with OOMKilled exit reason.

Kubernetes:
- Pods: 12 total, 3 currently OOMKilled, 9 running
- Memory limit: 512Mi, requests: 256Mi
- Recent restarts/hour: 18 (up from baseline of 0–1)
- HPA scaling has hit max replicas (12)

Memory trend (last 24h):
- Steady linear growth from 180Mi at 04:00 UTC to 500Mi at 09:00 UTC, then crash
- Pre-deploy baseline: stable at 200Mi indefinitely

Deploy history:
- 04:00 UTC — order-svc v3.7 deployed (rolling, 100% complete by 04:08)
- Changelog: "switched JSON parser to fast-json-stringify; added in-process request cache (Map-based, no eviction)"

Logs:
- No FATAL/ERROR pattern; pods exit silently
- Heap profile from one captured pod shows: ~340Mi held by Map keyed on request ID, never cleared

On-call:
- 09:12 — first OOMKilled pod
- 09:14 — pager
- 09:16 — service still serving most traffic via remaining pods; checkout success rate dipped from 99.7% → 96.8%

Summary

SEV1

order-svc pods are being OOMKilled every ~20 minutes due to a memory leak introduced in v3.7. The leak is caused by an in-process request cache (Map-based, no eviction) that grows unboundedly. 3 of 12 pods are in CrashLoopBackOff, p99 latency is degraded, and checkout success rate dropped from 99.7% to 96.8%.

Severity reasoning: User-facing outage: checkout success rate >1% (3.2% errors) for >5 minutes, revenue path broken (checkout failures), and error rate exceeds 1% threshold. This meets SEV1 criteria.

deepseek-chat·prompt v2·output: en·10558ms·1756↑ / 1634↓ tok·$0.00227

Root cause hypotheses

  • highMemory leak from in-process request cache (Map-based, no eviction) introduced in v3.7

    Evidence: Heap profile shows ~340Mi held by Map keyed on request ID, never cleared. Memory grows linearly from 180Mi to 500Mi over ~5h, then OOM. Changelog confirms 'added in-process request cache (Map-based, no eviction)'.

  • mediumIncreased request rate causing more cache entries to accumulate faster

    Evidence: Memory growth rate correlates with time of day (higher traffic after 04:00 UTC). However, baseline was stable at 200Mi before deploy, so rate alone cannot explain leak.

  • lowMemory limit too low for new cache overhead

    Evidence: Memory limit is 512Mi, and pods crash at ~500Mi. Pre-deploy baseline was 200Mi. If cache were bounded, 512Mi might be sufficient; but unbounded growth exceeds limit.

Investigation checklist

  1. Check memory usage trend per pod to confirm linear growth pattern
    kubectl top pod -n prod -l app=order-svc --sort-by=memory --no-headers | head -12

    Expected: Memory usage near 500Mi for running pods, with recent restarts showing lower values

  2. Inspect heap profile of a running pod to confirm Map size
    kubectl exec -n prod deploy/order-svc -- sh -c 'curl -s localhost:6060/debug/pprof/heap?debug=1 | grep -A5 "map\[" | head -20'

    Expected: Large number of entries in a map keyed by request ID, no eviction logic

  3. Check deploy rollout status and verify v3.7 is fully rolled out
    kubectl rollout status deploy/order-svc -n prod --revision=$(kubectl rollout history deploy/order-svc -n prod | tail -2 | head -1 | awk '{print $1}')

    Expected: deployment "order-svc" successfully rolled out

  4. Check if any recent config changes or feature flags enabled the cache
    kubectl get configmap -n prod order-svc-config -o yaml | grep -i cache

    Expected: No cache-related config or a flag that enables the cache

  5. Verify HPA metrics and current replica count
    kubectl get hpa -n prod order-svc-hpa -o wide

    Expected: Current replicas at max (12), target memory utilization > 80%

Mitigation plan

  • Rollback order-svc to v3.6 (previous stable version) to remove the memory leak

    Risk: Rollback may cause brief downtime during pod replacement; ensure traffic drains properly. No data loss expected.

    Rollback: Re-apply v3.7 deployment if rollback causes issues, or redeploy v3.6 with same config.

  • If rollback is not immediate, scale up HPA max replicas to 16 to distribute load and reduce per-pod memory pressure

    Risk: Temporary increase in resource usage; may not prevent OOM if leak continues, but buys time.

    Rollback: Revert HPA max replicas to 12 after rollback or fix.

  • Add a memory limit of 1Gi temporarily to prevent OOM kills while investigating

    Risk: Higher memory usage on nodes; may cause node pressure if many pods run. Monitor node memory.

    Rollback: Revert limit to 512Mi after fix is deployed.

Customer impact

Approximately 25% of checkout requests are failing or timing out. Users may see errors during order placement. Service is partially degraded; remaining pods are serving most traffic but with increased latency.

Postmortem draft

Summary

[FILL IN]

Timeline (UTC)

  • 04:00 — order-svc v3.7 deployed
  • 09:12 — first OOMKilled pod
  • 09:14 — pager triggered
  • 09:16 — on-call acknowledges, checkout success rate drops to 96.8%
  • [FILL IN] — rollback initiated
  • [FILL IN] — service stable

Impact

  • 3 of 12 pods in CrashLoopBackOff
  • Checkout success rate: 99.7% → 96.8%
  • p99 latency degraded
  • Estimated 3% of requests timing out

Root Cause

Memory leak from in-process request cache (Map-based, no eviction) introduced in v3.7. Cache entries accumulate indefinitely, causing pod memory to grow linearly until OOM.

Detection

  • Pod OOMKilled events triggered pager
  • Memory trend showed linear growth from 180Mi to 500Mi over 5 hours

Response

  • [FILL IN]

What Went Well

  • [FILL IN]

What Went Poorly

  • [FILL IN]

Action Items

  • [FILL IN]

Follow-ups

  • P1Add memory leak detection test for in-process caches in CI/CD pipelineplatform team
  • P1Implement bounded cache with TTL or LRU eviction in order-svcservice owner
  • P2Set up memory usage alert with linear growth detection (e.g., PromQL predict_linear)on-call SRE
  • P2Review deploy process to include canary or gradual rollout for memory-sensitive changesplatform team
  • P2Add heap profile snapshot to pod termination logs for future debuggingservice owner