← All incidents

[Eval][v3][en] Third-party payment gateway timeouts cascade into checkout outage

service: checkout-svccreated: 6/2/2026, 11:59:06 PM

Raw incident context

Time: 18:40 UTC. checkout-svc starts returning HTTP 504 to ~78% of checkout attempts.

Symptoms:
- p99 latency: 28s (hitting our 30s gateway timeout)
- Success rate: 22% (down from 99.5%)
- Thread pool utilization: 100% (all 200 worker threads blocked)
- Inbound queue depth: 4200 requests (queue limit 5000)

Downstream calls (from APM):
- Stripe Connect API (/v1/payment_intents): p99 jumped from 800ms to 27s
- Stripe status page: "Investigating elevated latency for Connect endpoints in us-east-1" (posted 18:35 UTC)
- All other downstream deps (auth-svc, fraud-svc) normal

Our config:
- Stripe call timeout: 30s (matches inbound)
- No circuit breaker on Stripe client
- No bulkhead — Stripe calls share the main worker thread pool

Customer impact:
- Failed checkouts: ~3000 in 5min
- Estimated lost GMV: $180k (so far)

On-call:
- 18:38 — pager
- 18:41 — confirmed Stripe is the culprit (status page + APM)
- 18:43 — debating: wait it out vs disable Stripe path entirely

Summary

SEV1

Checkout success rate dropped from 99.5% to 22% due to upstream Stripe Connect API latency spike (p99 27s), causing thread pool exhaustion and 30s timeouts. Stripe status page confirms elevated latency for Connect endpoints in us-east-1 since 18:35 UTC. No circuit breaker or bulkhead on Stripe client, so all 200 worker threads blocked, queue depth 4200. Estimated 3000 failed checkouts and $180k lost GMV in 5 minutes.

Severity reasoning: User-facing outage: success rate dropped to 22% (below 99% threshold) for >5 minutes, revenue path broken (checkout failures causing $180k GMV loss). Matches SEV1 criteria: error rate >1% for >5 min, revenue path broken.

deepseek-chat·prompt v3·output: en·12942ms·2002↑ / 2007↓ tok·$0.00275

Root cause hypotheses

  • highStripe Connect API latency spike causes thread pool exhaustion in checkout-svc

    Evidence: APM shows Stripe /v1/payment_intents p99 jumped from 800ms to 27s; Stripe status page confirms elevated latency for Connect endpoints in us-east-1 since 18:35 UTC; thread pool utilization 100%, all 200 threads blocked.

  • highNo circuit breaker on Stripe client allows cascading failures

    Evidence: Config review shows no circuit breaker or bulkhead; Stripe calls share main worker thread pool; inbound queue depth 4200 (limit 5000) indicates requests piling up.

  • mediumStripe timeout (30s) matches inbound gateway timeout, causing no fast failure

    Evidence: Stripe call timeout is 30s, same as inbound gateway timeout; p99 latency 28s hits timeout, but threads block for full 30s before releasing.

  • mediumRegional issue in us-east-1 affecting Stripe Connect

    Evidence: Stripe status page specifically calls out us-east-1; checkout-svc likely uses us-east-1 Stripe endpoints.

  • lowAuth-svc or fraud-svc contributing to latency

    Evidence: APM shows auth-svc and fraud-svc latencies normal; ruled out.

Investigation checklist

  1. Confirm Stripe status page for Connect latency
    curl -s https://status.stripe.com/ | grep -i 'connect'

    Expected: Should show 'Investigating elevated latency for Connect endpoints in us-east-1' or similar.

  2. Check thread pool utilization and queue depth in checkout-svc
    kubectl exec -n prod deploy/checkout-svc -- curl -s localhost:8080/actuator/threadpool | jq '.threadPool.utilization, .queueDepth'

    Expected: Utilization near 100%, queue depth near 4200.

  3. Verify Stripe call latency in APM
    curl -s 'http://apm-dashboard.internal/d/checkout-svc?var-service=checkout-svc&var-downstream=stripe&from=now-15m' | jq '.series[0].values[-1]'

    Expected: p99 latency for Stripe calls should be ~27s.

  4. Check if circuit breaker exists on Stripe client
    kubectl exec -n prod deploy/checkout-svc -- cat /app/config.yml | grep -A5 'stripe' | grep 'circuitBreaker'

    Expected: No output (circuit breaker not configured).

  5. Check other downstream services latency
    kubectl exec -n prod deploy/checkout-svc -- curl -s localhost:8080/actuator/health | jq '.details.auth-svc.latency, .details.fraud-svc.latency'

    Expected: Latency should be normal (<100ms).

Mitigation plan

  • Enable circuit breaker on Stripe client to fail fast when latency exceeds threshold (e.g., 2s).

    Risk: May cause partial checkout failures if circuit opens, but prevents thread pool exhaustion. Safer than disabling Stripe entirely.

    Rollback: Disable circuit breaker via feature flag or config revert.

  • Add bulkhead isolation for Stripe calls (dedicated thread pool of 20 threads) to protect main worker pool.

    Risk: If bulkhead is too small, Stripe calls may queue up; but main pool remains healthy.

    Rollback: Remove bulkhead config or increase pool size.

  • Reduce Stripe call timeout from 30s to 5s to fail fast and free threads sooner.

    Risk: May cause false positives during transient latency spikes; but acceptable during incident.

    Rollback: Revert timeout to 30s.

  • If Stripe outage persists, disable Stripe payment path and fall back to alternative payment provider or show error message.

    Risk: Lost revenue from Stripe payments; but prevents further failures. Destructive if no fallback.

    Rollback: Re-enable Stripe path when Stripe recovers.

Customer impact

Approximately 78% of checkout attempts are failing with a 504 error. Users see a timeout or error page after waiting up to 30 seconds. Estimated 3000 failed checkouts in the last 5 minutes, resulting in approximately $180,000 in lost revenue. No ETA yet; Stripe is investigating.

Postmortem draft

Summary

Checkout success rate dropped from 99.5% to 22% due to upstream Stripe Connect API latency spike. No circuit breaker or bulkhead caused thread pool exhaustion.

Timeline (UTC)

  • 18:35 - Stripe status page: 'Investigating elevated latency for Connect endpoints in us-east-1'
  • 18:38 - Pager triggered for checkout-svc high error rate
  • 18:40 - Success rate drops to 22%, p99 latency 28s
  • 18:41 - Confirmed Stripe is culprit via APM and status page
  • 18:43 - Team debating mitigation options
  • [FILL IN] - Mitigation applied
  • [FILL IN] - Recovery complete

Impact

  • 3000 failed checkouts in 5 minutes
  • $180k lost GMV
  • 78% of users affected

Root Cause

Stripe Connect API latency spike (p99 27s) combined with lack of circuit breaker and bulkhead caused thread pool exhaustion in checkout-svc.

Detection

Pager alert based on error rate threshold. Could have been faster if circuit breaker metrics were monitored.

Response

[FILL IN]

What Went Well

  • Quick identification of upstream dependency
  • Status page confirmation

What Went Poorly

  • No circuit breaker or bulkhead
  • Timeout too long (30s)
  • No automated fallback

Action Items

  • [ ] Add circuit breaker to Stripe client (P0)
  • [ ] Add bulkhead isolation for Stripe calls (P0)
  • [ ] Reduce Stripe timeout to 5s (P1)
  • [ ] Create runbook for Stripe outage (P1)
  • [ ] Monitor Stripe status page proactively (P2)

Follow-ups

  • P0Add circuit breaker to Stripe client with 2s timeout and 50% failure thresholdpayments-platform
  • P0Implement bulkhead isolation for Stripe calls with dedicated thread pool of 20 threadspayments-platform
  • P1Reduce Stripe call timeout from 30s to 5son-call SRE
  • P1Create runbook for Stripe outage with fallback proceduresservice owner
  • P2Set up alert on Stripe status page changes via webhookplatform team
  • P2Review and tune thread pool and queue limits for checkout-svcservice owner