← All incidents

[Eval][v2][en] Third-party payment gateway timeouts cascade into checkout outage

service: checkout-svccreated: 6/2/2026, 11:57:12 PM

Raw incident context

Time: 18:40 UTC. checkout-svc starts returning HTTP 504 to ~78% of checkout attempts.

Symptoms:
- p99 latency: 28s (hitting our 30s gateway timeout)
- Success rate: 22% (down from 99.5%)
- Thread pool utilization: 100% (all 200 worker threads blocked)
- Inbound queue depth: 4200 requests (queue limit 5000)

Downstream calls (from APM):
- Stripe Connect API (/v1/payment_intents): p99 jumped from 800ms to 27s
- Stripe status page: "Investigating elevated latency for Connect endpoints in us-east-1" (posted 18:35 UTC)
- All other downstream deps (auth-svc, fraud-svc) normal

Our config:
- Stripe call timeout: 30s (matches inbound)
- No circuit breaker on Stripe client
- No bulkhead — Stripe calls share the main worker thread pool

Customer impact:
- Failed checkouts: ~3000 in 5min
- Estimated lost GMV: $180k (so far)

On-call:
- 18:38 — pager
- 18:41 — confirmed Stripe is the culprit (status page + APM)
- 18:43 — debating: wait it out vs disable Stripe path entirely

Summary

SEV1

Checkout success rate dropped from 99.5% to 22% due to elevated latency from Stripe Connect API, causing thread pool exhaustion and 504 timeouts. ~3000 failed checkouts in 5 minutes with estimated $180k GMV loss. Incident is ongoing.

Severity reasoning: User-facing outage: checkout success rate < 1% (22%) for >5 min, revenue path broken (Stripe payment processing), and regional unavailability of Stripe Connect impacting all users. Meets SEV1 criteria.

deepseek-chat·prompt v2·output: en·10517ms·1751↑ / 1669↓ tok·$0.00231

Root cause hypotheses

  • highStripe Connect API latency spike due to upstream issue in us-east-1

    Evidence: Stripe status page: 'Investigating elevated latency for Connect endpoints in us-east-1' (18:35 UTC). APM shows p99 latency for /v1/payment_intents jumped from 800ms to 27s.

  • highNo circuit breaker on Stripe client causing cascading thread pool exhaustion

    Evidence: Thread pool utilization at 100% (200 workers blocked), inbound queue depth 4200. Stripe calls share main worker pool with no bulkhead.

  • mediumStripe timeout (30s) matches inbound gateway timeout, preventing fast failure

    Evidence: p99 latency 28s, hitting 30s timeout. No shorter timeout or circuit breaker to fail fast.

  • lowRecent deployment or config change to checkout-svc increased Stripe call volume

    Evidence: No direct evidence in context; low likelihood as Stripe status page indicates upstream issue.

Investigation checklist

  1. Check Stripe status page for updates
    curl -s https://status.stripe.com/ | grep -i 'connect'

    Expected: Status should show 'Investigating' or 'Resolved' for Connect endpoints.

  2. Verify Stripe latency from checkout-svc pods
    kubectl exec -n prod deploy/checkout-svc -- curl -o /dev/null -s -w '%{time_total}\n' https://api.stripe.com/v1/payment_intents -H 'Authorization: Bearer <redacted>'

    Expected: Response time should be <1s normally; if >10s, confirms Stripe latency.

  3. Check thread pool and queue depth in checkout-svc
    kubectl exec -n prod deploy/checkout-svc -- curl -s localhost:8080/actuator/threadpool | jq '.threadPool.utilization, .queueDepth'

    Expected: Utilization should be <80%; queue depth <1000. Currently 100% and 4200.

  4. Check for recent deployments to checkout-svc
    kubectl rollout history deployment -n prod checkout-svc

    Expected: No recent changes; if recent, check diff.

  5. Check downstream dependencies (auth-svc, fraud-svc) latency
    kubectl exec -n prod deploy/checkout-svc -- curl -o /dev/null -s -w '%{time_total}\n' http://auth-svc:8080/health

    Expected: Response time <100ms; if elevated, indicates broader issue.

Mitigation plan

  • Enable circuit breaker on Stripe client with fallback to fail fast (return 503 immediately) to protect thread pool

    Risk: All Stripe payments will fail fast, but thread pool will recover. No data loss.

    Rollback: Disable circuit breaker via feature flag or config reload.

  • Reduce Stripe timeout from 30s to 5s to fail fast and free threads

    Risk: Some legitimate payments may timeout if Stripe latency is high but not all requests are slow. Safer than current state.

    Rollback: Revert timeout to 30s via config push.

  • Implement bulkhead for Stripe calls (dedicated thread pool of 20 threads) to isolate from main pool

    Risk: Requires code change; cannot be done quickly. Not recommended for immediate mitigation.

    Rollback: Revert to shared pool via config.

Customer impact

Approximately 78% of checkout attempts are failing with HTTP 504 errors. ~3000 failed checkouts in 5 minutes, estimated $180k lost GMV. Customers cannot complete purchases. No ETA for resolution; dependent on Stripe recovery.

Postmortem draft

Summary

[FILL IN]

Timeline (UTC)

  • 18:35: Stripe status page reports elevated latency for Connect endpoints in us-east-1
  • 18:38: Pager alert for checkout-svc high error rate
  • 18:41: Confirmed Stripe is the culprit via status page and APM
  • 18:43: Debating mitigation options

Impact

  • Checkout success rate dropped from 99.5% to 22%
  • p99 latency 28s
  • ~3000 failed checkouts, $180k GMV loss

Root Cause

Stripe Connect API latency spike caused thread pool exhaustion in checkout-svc due to lack of circuit breaker and bulkhead.

Detection

Pager alert for error rate threshold breach.

Response

[FILL IN]

What Went Well

  • Quick identification of Stripe as culprit via status page
  • APM data clearly showed downstream latency

What Went Poorly

  • No circuit breaker on Stripe client
  • No bulkhead isolation
  • Timeout too long (30s)

Action Items

  • [ ] Add circuit breaker to Stripe client
  • [ ] Implement bulkhead for Stripe calls
  • [ ] Reduce Stripe timeout to 5s
  • [ ] Add alert on thread pool utilization

Follow-ups

  • P0Add circuit breaker to Stripe HTTP client with fallbackcheckout-svc service owner
  • P0Implement bulkhead (dedicated thread pool) for Stripe callscheckout-svc service owner
  • P1Reduce Stripe API timeout from 30s to 5scheckout-svc service owner
  • P1Add alert on thread pool utilization >80%platform team
  • P2Review and improve incident response runbook for Stripe dependencieson-call SRE
  • P2Evaluate adding Stripe status page as a monitoring sourceplatform team