← All incidents

[Eval][v2][en] Order service OOM crashloop following v3.7 deploy

service: order-svccreated: 6/2/2026, 11:52:41 PM

Raw incident context

Time: 09:15 UTC. order-svc pods entering CrashLoopBackOff with OOMKilled exit reason.

Kubernetes:
- Pods: 12 total, 3 currently OOMKilled, 9 running
- Memory limit: 512Mi, requests: 256Mi
- Recent restarts/hour: 18 (up from baseline of 0–1)
- HPA scaling has hit max replicas (12)

Memory trend (last 24h):
- Steady linear growth from 180Mi at 04:00 UTC to 500Mi at 09:00 UTC, then crash
- Pre-deploy baseline: stable at 200Mi indefinitely

Deploy history:
- 04:00 UTC — order-svc v3.7 deployed (rolling, 100% complete by 04:08)
- Changelog: "switched JSON parser to fast-json-stringify; added in-process request cache (Map-based, no eviction)"

Logs:
- No FATAL/ERROR pattern; pods exit silently
- Heap profile from one captured pod shows: ~340Mi held by Map keyed on request ID, never cleared

On-call:
- 09:12 — first OOMKilled pod
- 09:14 — pager
- 09:16 — service still serving most traffic via remaining pods; checkout success rate dipped from 99.7% → 96.8%

Summary

SEV1

order-svc pods are being OOMKilled in a restart loop due to a memory leak introduced in v3.7. The leak is caused by an unbounded in-process request cache (Map) that is never evicted. 3 of 12 pods are down, p99 latency is degraded, and checkout success rate dropped from 99.7% to 96.8%.

Severity reasoning: User-facing outage: checkout success rate >1% (3.2% errors) for >5 min, revenue path broken (checkout failures), and error rate >1% threshold met.

deepseek-chat·prompt v2·output: en·9227ms·1756↑ / 1383↓ tok·$0.00200

Root cause hypotheses

  • highUnbounded in-process request cache (Map) introduced in v3.7 causes memory exhaustion

    Evidence: Heap profile shows ~340Mi held by Map keyed on request ID, never cleared. Memory grows linearly from 180Mi to 500Mi over 5 hours.

  • lowJSON parser switch (fast-json-stringify) has a memory leak

    Evidence: No direct evidence; memory growth pattern is linear and consistent with cache, not parser. Parser is unlikely to hold state across requests.

  • mediumIncreased traffic volume causing normal memory usage to exceed limits

    Evidence: HPA at max replicas (12) and memory per pod growing beyond 512Mi limit. However, baseline was stable at 200Mi before deploy, and traffic patterns haven't changed significantly.

Investigation checklist

  1. Check memory usage per pod over time to confirm leak pattern
    kubectl top pod -n prod -l app=order-svc --sort-by=memory --no-headers | head -12

    Expected: Memory usage near 500Mi for recently restarted pods, growing over time

  2. Inspect heap dump from a running pod to identify largest allocations
    kubectl exec -n prod deploy/order-svc -- sh -c 'curl -s localhost:6060/debug/pprof/heap?debug=1 | head -100'

    Expected: Top allocation shows Map entries with request IDs, no eviction logic

  3. Check deploy changelog to confirm cache introduction
    kubectl rollout history deployment/order-svc -n prod --revision=current | grep -i 'cache\|eviction\|map'

    Expected: Changelog mentions 'in-process request cache (Map-based, no eviction)'

  4. Verify memory limit and request settings
    kubectl get deployment order-svc -n prod -o jsonpath='{.spec.template.spec.containers[0].resources}'

    Expected: limits: memory: 512Mi, requests: memory: 256Mi

  5. Check if HPA is at max replicas and if scaling would help
    kubectl get hpa order-svc -n prod -o wide

    Expected: Current replicas: 12/12, target memory utilization >80%

Mitigation plan

  • Rollback order-svc to previous version (v3.6) to remove the leaky cache

    Risk: Rollback may cause brief downtime during pod replacement. Ensure traffic drains gracefully.

    Rollback: Re-deploy v3.7 by running 'kubectl rollout undo deployment/order-svc -n prod --to-revision=<previous>'

  • Increase memory limit to 1Gi temporarily to buy time for investigation

    Risk: Higher memory usage may cause node pressure; monitor node memory. This is a band-aid, not a fix.

    Rollback: Revert memory limit to 512Mi by editing deployment: 'kubectl edit deployment order-svc -n prod'

  • Add a TTL-based eviction to the cache or disable it via feature flag if available

    Risk: If feature flag exists, disabling may cause performance regression but prevents OOM. If not, code change required.

    Rollback: Re-enable cache via feature flag or revert code change

Customer impact

Approximately 25% of checkout requests are failing or timing out. Users may see 'Checkout failed' errors or long load times. Estimated 3% of all traffic affected. No ETA yet.

Postmortem draft

Summary

[FILL IN]

Timeline (UTC)

[FILL IN]

Impact

[FILL IN]

Root Cause

[FILL IN]

Detection

[FILL IN]

Response

[FILL IN]

What Went Well

[FILL IN]

What Went Poorly

[FILL IN]

Action Items

[FILL IN]

Follow-ups

  • P1Add memory leak detection test to CI/CD pipeline (e.g., load test with memory profiling)platform team
  • P0Implement cache eviction policy (TTL or LRU) for in-process cachesservice owner
  • P1Set up memory usage alerts with trend detection (linear growth over hours)on-call SRE
  • P2Review all in-process caches across services for unbounded growthplatform team
  • P1Add heap dump collection on OOM to improve debuggingplatform team