← All incidents

[Eval][v2][en] Order service OOM crashloop following v3.7 deploy

service: order-svccreated: 6/2/2026, 11:37:01 PM

Raw incident context

Time: 09:15 UTC. order-svc pods entering CrashLoopBackOff with OOMKilled exit reason.

Kubernetes:
- Pods: 12 total, 3 currently OOMKilled, 9 running
- Memory limit: 512Mi, requests: 256Mi
- Recent restarts/hour: 18 (up from baseline of 0–1)
- HPA scaling has hit max replicas (12)

Memory trend (last 24h):
- Steady linear growth from 180Mi at 04:00 UTC to 500Mi at 09:00 UTC, then crash
- Pre-deploy baseline: stable at 200Mi indefinitely

Deploy history:
- 04:00 UTC — order-svc v3.7 deployed (rolling, 100% complete by 04:08)
- Changelog: "switched JSON parser to fast-json-stringify; added in-process request cache (Map-based, no eviction)"

Logs:
- No FATAL/ERROR pattern; pods exit silently
- Heap profile from one captured pod shows: ~340Mi held by Map keyed on request ID, never cleared

On-call:
- 09:12 — first OOMKilled pod
- 09:14 — pager
- 09:16 — service still serving most traffic via remaining pods; checkout success rate dipped from 99.7% → 96.8%

Summary

SEV1

order-svc pods are being OOMKilled in a restart loop due to a memory leak introduced in v3.7. The leak is caused by an in-process request cache (Map-based, no eviction) that grows unboundedly. 3 of 12 pods are down, p99 latency is degraded, and checkout success rate dropped from 99.7% to 96.8%.

Severity reasoning: User-facing outage: checkout success rate >1% (3.2% errors) for >5 min, revenue path broken (checkout failures), and error rate exceeds 1% threshold. Meets SEV1 criteria.

deepseek-chat·prompt v2·output: en·10713ms·1756↑ / 1582↓ tok·$0.00221

Root cause hypotheses

  • highIn-process request cache (Map-based, no eviction) introduced in v3.7 grows unboundedly, consuming all pod memory.

    Evidence: Heap profile shows ~340Mi held by Map keyed on request ID, never cleared. Memory trend shows linear growth from 180Mi to 500Mi over 5 hours, then OOM.

  • mediumMemory limit (512Mi) is too low for the new cache, but cache is intended to be bounded.

    Evidence: Pre-deploy baseline was stable at 200Mi. Cache was added without eviction policy, so it's not a limit issue but a leak.

  • lowExternal dependency (e.g., database connection pool) is leaking connections, causing memory pressure.

    Evidence: No connection pool errors in logs; heap profile shows Map as primary consumer, not connection objects.

Investigation checklist

  1. Check current pod status and memory usage across all replicas.
    kubectl top pod -n prod -l app=order-svc --sort-by=memory

    Expected: Memory usage near 500Mi for running pods; OOMKilled pods show 0 memory.

  2. Verify the heap profile from a captured pod to confirm Map memory dominance.
    kubectl exec -n prod -l app=order-svc -- sh -c 'curl -s localhost:6060/debug/pprof/heap?debug=1 | head -100'

    Expected: Output shows 'Map' entries consuming ~340Mi, with no eviction logic.

  3. Check deploy changelog to confirm the cache addition and absence of eviction.
    kubectl rollout history deployment/order-svc -n prod --revision=3 | grep -i cache

    Expected: Line: 'added in-process request cache (Map-based, no eviction)'

  4. Monitor memory growth rate on a running pod to confirm linear trend.
    kubectl exec -n prod -l app=order-svc -- sh -c 'while true; do cat /sys/fs/cgroup/memory/memory.usage_in_bytes; sleep 60; done'

    Expected: Memory increases by ~1-2Mi per minute, consistent with leak.

  5. Check if any recent traffic spike correlates with faster growth.
    kubectl logs -n prod -l app=order-svc --since=1h | grep -c 'request'

    Expected: Request count per minute; compare with memory growth rate.

Mitigation plan

  • Rollback order-svc to previous version (v3.6) to remove the leaky cache.

    Risk: Rollback will cause brief downtime during pod replacement; cache benefits lost. Safe operation.

    Rollback: Re-deploy v3.7 if rollback causes issues; or re-apply with fixed cache.

  • If rollback not immediately possible, increase memory limit to 1Gi to buy time.

    Risk: Higher memory usage may cause node pressure; temporary fix only.

    Rollback: Revert memory limit to 512Mi after rollback or fix.

  • Scale up HPA max replicas to 16 to spread load and reduce per-pod memory pressure.

    Risk: More pods may increase database connection count; monitor for connection pool exhaustion.

    Rollback: Reduce max replicas back to 12.

Customer impact

Approximately 25% of checkout requests are failing or timing out. Users see 'Checkout failed' errors. Estimated 3% of all users affected. No data loss. ETA for full recovery: 30 minutes after rollback.

Postmortem draft

Summary

[FILL IN]

Timeline (UTC)

  • 04:00 - order-svc v3.7 deployed with in-process request cache (no eviction)
  • 09:12 - First pod OOMKilled
  • 09:14 - Pager triggered
  • 09:16 - Incident declared; checkout success rate drops to 96.8%
  • [FILL IN] - Rollback initiated
  • [FILL IN] - All pods stable, success rate recovers

Impact

  • 3 of 12 pods OOMKilled in restart loop
  • p99 latency degraded
  • Checkout success rate: 99.7% → 96.8%
  • Duration: [FILL IN]

Root Cause

Memory leak in in-process request cache (Map-based, no eviction) introduced in v3.7. Cache entries never removed, causing unbounded memory growth until OOM.

Detection

Pager alert for OOMKilled pods; success rate monitoring showed dip.

Response

  • Identified memory leak via heap profile
  • Rolled back to v3.6
  • Increased memory limit temporarily

What Went Well

  • Quick detection via monitoring
  • Heap profile available for diagnosis

What Went Poorly

  • No eviction policy in cache design
  • No memory limit testing before deploy

Action Items

  • [FILL IN]

Follow-ups

  • P0Add eviction policy (TTL or LRU) to in-process cache before re-enabling.service owner
  • P1Set up memory usage alert at 80% of limit to warn before OOM.on-call SRE
  • P1Add memory leak detection test in CI (e.g., heap profile diff).platform team
  • P2Review deploy process for performance regression testing.service owner