← All incidents

[Eval][v3][en] Order service OOM crashloop following v3.7 deploy

service: order-svccreated: 6/2/2026, 11:54:05 PM

Raw incident context

Time: 09:15 UTC. order-svc pods entering CrashLoopBackOff with OOMKilled exit reason.

Kubernetes:
- Pods: 12 total, 3 currently OOMKilled, 9 running
- Memory limit: 512Mi, requests: 256Mi
- Recent restarts/hour: 18 (up from baseline of 0–1)
- HPA scaling has hit max replicas (12)

Memory trend (last 24h):
- Steady linear growth from 180Mi at 04:00 UTC to 500Mi at 09:00 UTC, then crash
- Pre-deploy baseline: stable at 200Mi indefinitely

Deploy history:
- 04:00 UTC — order-svc v3.7 deployed (rolling, 100% complete by 04:08)
- Changelog: "switched JSON parser to fast-json-stringify; added in-process request cache (Map-based, no eviction)"

Logs:
- No FATAL/ERROR pattern; pods exit silently
- Heap profile from one captured pod shows: ~340Mi held by Map keyed on request ID, never cleared

On-call:
- 09:12 — first OOMKilled pod
- 09:14 — pager
- 09:16 — service still serving most traffic via remaining pods; checkout success rate dipped from 99.7% → 96.8%

Summary

SEV1

order-svc pods are being OOMKilled every ~20 minutes due to a memory leak introduced in v3.7. A new in-process request cache (Map-based, no eviction) holds ~340Mi of heap, causing pods to hit the 512Mi memory limit and crash. 3 of 12 pods are in CrashLoopBackOff, p99 latency is degraded, and checkout success rate dropped from 99.7% to 96.8%.

Severity reasoning: User-facing outage: checkout success rate dropped from 99.7% to 96.8% (>1% error rate) for >5 minutes. Revenue path is broken (checkout failures). This meets SEV1 criteria.

deepseek-chat·prompt v3·output: en·11132ms·2007↑ / 1708↓ tok·$0.00242

Root cause hypotheses

  • highMemory leak from in-process request cache introduced in v3.7 (Map-based, no eviction).

    Evidence: Heap profile shows ~340Mi held by Map keyed on request ID, never cleared. Memory grows linearly from 180Mi to 500Mi over 5 hours post-deploy, then crashes. Changelog confirms 'added in-process request cache (Map-based, no eviction)'.

  • lowIncreased request rate causing cache to grow faster than expected.

    Evidence: No evidence of rate change; memory growth is steady and correlates with deploy time, not traffic spikes.

  • lowJSON parser (fast-json-stringify) memory leak.

    Evidence: Heap profile shows Map as dominant consumer, not JSON parser. No other memory growth pattern.

Investigation checklist

  1. Confirm memory growth trend in Grafana for order-svc pods.
    curl -s 'http://grafana.example.com/api/datasources/proxy/1/query?query=sum(container_memory_working_set_bytes{namespace="prod",pod=~"order-svc.*"})&start=2025-04-01T04:00:00Z&end=2025-04-01T09:30:00Z&step=60s' | jq '.data.result[0].values'

    Expected: Linear increase from ~180Mi to ~500Mi over 5 hours.

  2. Check heap dump of a running pod to confirm Map size.
    kubectl exec -n prod deploy/order-svc -- jcmd 1 GC.heap_dump /tmp/heap.hprof && kubectl cp prod/order-svc-xxx:/tmp/heap.hprof ./heap.hprof && jhat heap.hprof | grep -i map

    Expected: Map object holding ~340Mi, keyed by request ID.

  3. Verify deploy timestamp and changelog for v3.7.
    kubectl rollout history deployment/order-svc -n prod | grep v3.7

    Expected: Revision 42, deployed at 2025-04-01T04:00:00Z, changelog: 'switched JSON parser to fast-json-stringify; added in-process request cache (Map-based, no eviction)'.

  4. Check if cache eviction or TTL exists in code.
    kubectl exec -n prod deploy/order-svc -- grep -r 'evict\|TTL\|expire' /app/src

    Expected: No matches found, confirming no eviction logic.

Mitigation plan

  • Rollback order-svc to v3.6 (previous stable version).

    Risk: Brief traffic disruption during rollback; no data loss.

    Rollback: Re-deploy v3.7 if rollback causes issues (unlikely).

  • If rollback not possible, increase memory limit to 1Gi temporarily.

    Risk: May mask leak; could cause node memory pressure if multiple pods scale up.

    Rollback: Revert limit to 512Mi after fix is deployed.

  • Scale down HPA max replicas to 6 to reduce cluster load.

    Risk: May increase latency under high traffic; could overwhelm remaining pods.

    Rollback: Restore max replicas to 12.

Customer impact

Approximately 3% of checkout requests are failing or timing out. Users may see errors when placing orders. The issue started around 09:12 UTC and is ongoing.

Postmortem draft

Postmortem: order-svc OOMKilled due to memory leak

Summary

On 2025-04-01, order-svc v3.7 introduced a memory leak via an unbounded in-process request cache, causing pods to OOMKill every ~20 minutes. Checkout success rate dropped from 99.7% to 96.8%. Mitigated by rollback to v3.6.

Timeline (UTC)

  • 04:00 - v3.7 deployed
  • 09:12 - First OOMKilled pod
  • 09:14 - Pager triggered
  • 09:16 - Incident declared (SEV1)
  • [FILL IN] - Rollback initiated
  • [FILL IN] - All pods stable

Impact

  • 3 of 12 pods in CrashLoopBackOff
  • Checkout success rate: 96.8% (vs 99.7% baseline)
  • p99 latency degraded
  • Duration: [FILL IN] minutes

Root Cause

In-process request cache (Map) added in v3.7 without eviction or TTL. Cache grew unbounded, consuming ~340Mi heap per pod, exceeding 512Mi limit.

Detection

  • Pager from OOMKilled alerts
  • Grafana memory trend showed linear growth

Response

  • Rolled back to v3.6
  • Increased memory limit temporarily

What Went Well

  • Quick detection via OOMKilled alerts
  • Heap dump analysis identified root cause rapidly

What Went Poorly

  • No code review caught missing eviction
  • No load testing with realistic cache growth

Action Items

  • [ ] Add TTL/eviction to cache (P0)
  • [ ] Add memory leak detection in CI (P1)
  • [ ] Review changelog for similar patterns (P2)

Follow-ups

  • P0Add TTL or LRU eviction to in-process request cache in order-svc.order-svc team
  • P1Add memory leak detection test (e.g., heap growth over time) to CI pipeline.platform team
  • P1Review all recent deployments for similar unbounded caches or missing eviction.service owners
  • P2Set up alert on memory growth rate (e.g., >100Mi/hour) for order-svc.on-call SRE
  • P2Update postmortem with exact timestamps and customer impact numbers.incident commander