← All incidents

[Eval][v3][en] Third-party payment gateway timeouts cascade into checkout outage

service: checkout-svccreated: 6/2/2026, 11:38:59 PM

Raw incident context

Time: 18:40 UTC. checkout-svc starts returning HTTP 504 to ~78% of checkout attempts.

Symptoms:
- p99 latency: 28s (hitting our 30s gateway timeout)
- Success rate: 22% (down from 99.5%)
- Thread pool utilization: 100% (all 200 worker threads blocked)
- Inbound queue depth: 4200 requests (queue limit 5000)

Downstream calls (from APM):
- Stripe Connect API (/v1/payment_intents): p99 jumped from 800ms to 27s
- Stripe status page: "Investigating elevated latency for Connect endpoints in us-east-1" (posted 18:35 UTC)
- All other downstream deps (auth-svc, fraud-svc) normal

Our config:
- Stripe call timeout: 30s (matches inbound)
- No circuit breaker on Stripe client
- No bulkhead — Stripe calls share the main worker thread pool

Customer impact:
- Failed checkouts: ~3000 in 5min
- Estimated lost GMV: $180k (so far)

On-call:
- 18:38 — pager
- 18:41 — confirmed Stripe is the culprit (status page + APM)
- 18:43 — debating: wait it out vs disable Stripe path entirely

Summary

SEV1

Checkout success rate dropped from 99.5% to 22% due to elevated latency from Stripe Connect API, causing thread pool exhaustion and HTTP 504 errors. Stripe reported issues with their Connect endpoints in us-east-1 starting at 18:35 UTC. No circuit breaker or bulkhead on Stripe calls led to cascading failure.

Severity reasoning: User-facing outage: checkout success rate dropped to 22% (below 99% threshold), error rate >1% for >5 minutes, revenue path broken (estimated $180k GMV lost). This meets SEV1 criteria.

deepseek-chat·prompt v3·output: en·12353ms·2002↑ / 2026↓ tok·$0.00277

Root cause hypotheses

  • highStripe Connect API latency spike caused thread pool exhaustion in checkout-svc

    Evidence: APM shows Stripe Connect API p99 latency jumped from 800ms to 27s. Thread pool utilization at 100%, all 200 threads blocked. Stripe status page confirms elevated latency for Connect endpoints in us-east-1.

  • highNo circuit breaker on Stripe client allowed unbounded retries or waiting

    Evidence: Configuration review shows no circuit breaker or bulkhead for Stripe calls. Inbound queue depth 4200 (limit 5000) indicates requests piling up.

  • mediumStripe timeout (30s) matches inbound gateway timeout, causing no early failure

    Evidence: Stripe call timeout is 30s, same as inbound gateway timeout. This means downstream calls can block threads for the full timeout period without fast-failing.

  • mediumStripe issue is isolated to us-east-1, but checkout-svc only calls that region

    Evidence: Stripe status page mentions us-east-1. No evidence of multi-region Stripe configuration.

  • lowAuth-svc or fraud-svc also affected but not detected

    Evidence: APM shows auth-svc and fraud-svc latencies normal. No evidence of impact.

Investigation checklist

  1. Verify Stripe status page for current incident details
    curl -s https://status.stripe.com/api/v1/incidents | jq '.incidents[] | select(.status != "resolved")'

    Expected: Incident with elevated latency for Connect endpoints in us-east-1, status 'investigating' or 'identified'

  2. Check thread pool utilization and queue depth in checkout-svc
    kubectl exec -n prod deploy/checkout-svc -- curl -s localhost:8080/actuator/threadpool | jq '.threadPool.utilization, .queueDepth'

    Expected: utilization: 100%, queueDepth: near 5000

  3. Check Stripe client configuration for circuit breaker and timeout
    kubectl exec -n prod deploy/checkout-svc -- cat /app/config/stripe.yml | grep -E 'timeout|circuit|bulkhead'

    Expected: timeout: 30000, no circuit breaker or bulkhead settings

  4. Check APM for Stripe Connect latency trend
    curl -s 'http://apm-dashboard:8080/api/v1/metrics?service=checkout-svc&metric=stripe.latency&time=18:35-18:45' | jq '.series[] | select(.tags.endpoint=="/v1/payment_intents")'

    Expected: p99 latency spiking from 800ms to 27s starting at 18:35

  5. Check if Stripe has a failover region or alternate endpoint
    kubectl exec -n prod deploy/checkout-svc -- cat /app/config/stripe.yml | grep -E 'region|endpoint'

    Expected: Only us-east-1 endpoint configured

Mitigation plan

  • Enable circuit breaker on Stripe client to fail fast when latency exceeds threshold (e.g., 5s)

    Risk: May cause false positives if threshold too low; could temporarily block legitimate traffic if Stripe recovers quickly

    Rollback: Disable circuit breaker by reverting config change

  • Reduce Stripe call timeout from 30s to 5s to free threads faster

    Risk: Some legitimate Stripe calls may timeout if latency is high but not failing; could increase error rate

    Rollback: Revert timeout to 30s

  • Implement bulkhead isolation for Stripe calls (dedicated thread pool of 50 threads)

    Risk: May limit throughput to Stripe; if bulkhead pool is too small, Stripe calls may queue up

    Rollback: Remove bulkhead config or increase pool size

  • If Stripe incident persists, disable Stripe payment path and fall back to alternative payment provider or show maintenance page

    Risk: Lost revenue from Stripe payments; customer frustration if payment fails

    Rollback: Re-enable Stripe path after incident resolved

Customer impact

Approximately 78% of checkout attempts are failing with a 504 error. Customers cannot complete purchases using credit/debit cards processed through Stripe. Estimated 3000 failed checkouts in the last 5 minutes, with $180k in lost revenue. No ETA for resolution; Stripe is investigating.

Postmortem draft

Summary

Checkout success rate dropped from 99.5% to 22% due to elevated latency from Stripe Connect API, causing thread pool exhaustion. No circuit breaker or bulkhead on Stripe calls led to cascading failure.

Timeline (UTC)

  • 18:35 - Stripe status page reports elevated latency for Connect endpoints in us-east-1
  • 18:38 - Pager alert for checkout-svc high error rate
  • 18:40 - Checkout success rate at 22%, p99 latency 28s
  • 18:41 - Confirmed Stripe as culprit via APM and status page
  • 18:43 - Team debating mitigation options
  • [FILL IN] - Mitigation applied
  • [FILL IN] - Incident resolved

Impact

  • 3000 failed checkouts in 5 minutes
  • Estimated $180k lost GMV
  • 78% of users affected

Root Cause

Stripe Connect API latency spike (p99 from 800ms to 27s) combined with lack of circuit breaker and bulkhead caused thread pool exhaustion in checkout-svc.

Detection

Pager alert triggered by error rate threshold. Stripe status page confirmed external dependency issue.

Response

  • [FILL IN] - Actions taken

What Went Well

  • Quick identification of Stripe as root cause via APM and status page
  • Team communication clear

What Went Poorly

  • No circuit breaker or bulkhead on Stripe calls
  • Timeout configuration matched inbound timeout, exacerbating thread exhaustion

Action Items

  • [ ] Add circuit breaker to Stripe client
  • [ ] Implement bulkhead isolation for downstream calls
  • [ ] Reduce Stripe timeout to 5s
  • [ ] Add monitoring for thread pool utilization and queue depth
  • [ ] Review all external dependency configurations for resilience patterns

Follow-ups

  • P0Add circuit breaker to Stripe client with appropriate thresholdscheckout-svc team
  • P0Implement bulkhead isolation for Stripe calls (dedicated thread pool)checkout-svc team
  • P1Reduce Stripe call timeout from 30s to 5scheckout-svc team
  • P1Add monitoring alerts for thread pool utilization and queue depthplatform team
  • P1Review all external dependency configurations for resilience patterns (circuit breaker, bulkhead, timeouts)platform team
  • P2Create runbook for Stripe dependency failure with mitigation stepson-call SRE
  • P2Evaluate multi-region Stripe configuration for failovercheckout-svc team