← All incidents

[Eval][v3][en] Third-party payment gateway timeouts cascade into checkout outage

service: checkout-svccreated: 6/2/2026, 11:58:33 PM

Raw incident context

Time: 18:40 UTC. checkout-svc starts returning HTTP 504 to ~78% of checkout attempts.

Symptoms:
- p99 latency: 28s (hitting our 30s gateway timeout)
- Success rate: 22% (down from 99.5%)
- Thread pool utilization: 100% (all 200 worker threads blocked)
- Inbound queue depth: 4200 requests (queue limit 5000)

Downstream calls (from APM):
- Stripe Connect API (/v1/payment_intents): p99 jumped from 800ms to 27s
- Stripe status page: "Investigating elevated latency for Connect endpoints in us-east-1" (posted 18:35 UTC)
- All other downstream deps (auth-svc, fraud-svc) normal

Our config:
- Stripe call timeout: 30s (matches inbound)
- No circuit breaker on Stripe client
- No bulkhead — Stripe calls share the main worker thread pool

Customer impact:
- Failed checkouts: ~3000 in 5min
- Estimated lost GMV: $180k (so far)

On-call:
- 18:38 — pager
- 18:41 — confirmed Stripe is the culprit (status page + APM)
- 18:43 — debating: wait it out vs disable Stripe path entirely

Summary

SEV1

Checkout success rate dropped from 99.5% to 22% due to elevated latency from Stripe Connect API, causing thread pool exhaustion and HTTP 504 errors. Approximately 3000 checkouts failed in 5 minutes, with an estimated $180k GMV loss. The incident is ongoing as of 18:43 UTC.

Severity reasoning: User-facing outage: error rate > 1% (78% failure rate) for >5 minutes, revenue path broken (checkout failures causing $180k GMV loss). Matches SEV1 criteria.

deepseek-chat·prompt v3·output: en·12428ms·2002↑ / 1869↓ tok·$0.00260

Root cause hypotheses

  • highStripe Connect API latency spike causing thread pool exhaustion in checkout-svc

    Evidence: APM shows Stripe /v1/payment_intents p99 jumped from 800ms to 27s; Stripe status page confirms elevated latency for Connect endpoints in us-east-1; thread pool utilization at 100% with all 200 workers blocked.

  • highNo circuit breaker on Stripe client leads to cascading failures

    Evidence: Configuration review shows no circuit breaker or bulkhead for Stripe calls; all threads block waiting for Stripe response, causing queue buildup.

  • lowInbound request surge exacerbates thread pool exhaustion

    Evidence: Inbound queue depth at 4200 (limit 5000), but success rate drop correlates with Stripe latency increase; no evidence of abnormal traffic spike before incident.

  • mediumStripe API timeout misconfigured (30s) too high, causing long-held threads

    Evidence: Timeout matches inbound gateway timeout; reducing it could free threads faster, but not root cause of Stripe latency.

Investigation checklist

  1. Confirm Stripe latency via APM and status page
    curl -s https://status.stripe.com/api/v1/incidents | jq '.incidents[] | select(.status != "resolved")'

    Expected: Active incident for Connect endpoints in us-east-1 with elevated latency

  2. Check thread pool utilization and queue depth in checkout-svc
    kubectl exec -n prod deploy/checkout-svc -- curl -s localhost:8080/actuator/threaddump | grep -E 'pool-.*thread.*state=BLOCKED' | wc -l

    Expected: 200 blocked threads (all workers)

  3. Verify downstream dependencies (auth-svc, fraud-svc) are healthy
    kubectl logs -n prod -l app=auth-svc --since=10m | grep -iE 'error|timeout' | tail -5

    Expected: No errors or timeouts

  4. Check Stripe API response times from checkout-svc logs
    kubectl logs -n prod -l app=checkout-svc --since=15m | grep -oP 'stripe_request_duration_ms=\K\d+' | sort -n | tail -5

    Expected: Values near 27000 (27s) indicating high latency

  5. Review circuit breaker and bulkhead configuration for Stripe client
    kubectl exec -n prod deploy/checkout-svc -- cat /app/config/application.yml | grep -A10 'stripe'

    Expected: No circuit breaker or bulkhead settings; only timeout=30s

Mitigation plan

  • Enable circuit breaker for Stripe client with fallback to error page or retry later

    Risk: May cause immediate failures for all Stripe-dependent checkouts if circuit opens; but prevents thread exhaustion.

    Rollback: Disable circuit breaker via feature flag or config reload.

  • Reduce Stripe call timeout from 30s to 5s to free threads faster

    Risk: May cause more timeouts if Stripe latency is high, but reduces thread hold time.

    Rollback: Revert timeout to 30s via config update.

  • Implement bulkhead isolation for Stripe calls (dedicated thread pool of 20 threads)

    Risk: Requires code change and redeploy; not immediate. Could be done as hotfix if deployment pipeline allows.

    Rollback: Revert to shared thread pool via config.

  • If Stripe outage persists, disable Stripe payment path and show 'payment unavailable' message

    Risk: Lost revenue from Stripe payments, but preserves checkout flow for other payment methods.

    Rollback: Re-enable Stripe path when Stripe recovers.

Customer impact

Approximately 78% of checkout attempts are failing with a 504 error. Customers see a timeout or error page when trying to complete purchases. Estimated 3000 failed checkouts in the last 5 minutes, with $180k in lost GMV. No ETA for resolution yet.

Postmortem draft

Summary

Checkout success rate dropped from 99.5% to 22% due to elevated latency from Stripe Connect API, causing thread pool exhaustion and HTTP 504 errors.

Timeline (UTC)

  • 18:35: Stripe status page reports elevated latency for Connect endpoints in us-east-1.
  • 18:38: Pager alerts for checkout-svc high error rate.
  • 18:41: Confirmed Stripe as culprit via APM and status page.
  • 18:43: Mitigation options being evaluated.

Impact

  • 3000 failed checkouts in 5 minutes.
  • $180k estimated lost GMV.
  • p99 latency 28s, success rate 22%.

Root Cause

Stripe Connect API latency spike (p99 from 800ms to 27s) combined with lack of circuit breaker or bulkhead in checkout-svc, causing all worker threads to block and queue to fill.

Detection

Pager alert based on error rate threshold. Stripe status page confirmed external issue.

Response

  • 18:38: On-call acknowledged.
  • 18:41: Identified Stripe as cause.
  • [FILL IN] Mitigation actions taken.

What Went Well

  • Quick identification of external dependency via APM and status page.
  • Clear communication of impact.

What Went Poorly

  • No circuit breaker or bulkhead for critical dependency.
  • Timeout too long (30s) exacerbating thread exhaustion.

Action Items

  • [FILL IN] Implement circuit breaker for Stripe client.
  • [FILL IN] Add bulkhead isolation for Stripe calls.
  • [FILL IN] Reduce Stripe timeout to 5s.
  • [FILL IN] Add monitoring for thread pool utilization and queue depth.

Follow-ups

  • P0Implement circuit breaker for Stripe client with fallbackpayments-platform
  • P0Add bulkhead isolation for Stripe calls (dedicated thread pool)payments-platform
  • P1Reduce Stripe call timeout from 30s to 5son-call SRE
  • P1Add monitoring alerts for thread pool utilization >80% and queue depth >80%platform team
  • P2Review all external dependency configurations for circuit breakers and bulkheadsservice owner
  • P2Update runbook for Stripe dependency failure with mitigation stepson-call SRE