← All incidents

[Eval][v1][en] Third-party payment gateway timeouts cascade into checkout outage

service: checkout-svccreated: 6/2/2026, 11:37:58 PM

Raw incident context

Time: 18:40 UTC. checkout-svc starts returning HTTP 504 to ~78% of checkout attempts.

Symptoms:
- p99 latency: 28s (hitting our 30s gateway timeout)
- Success rate: 22% (down from 99.5%)
- Thread pool utilization: 100% (all 200 worker threads blocked)
- Inbound queue depth: 4200 requests (queue limit 5000)

Downstream calls (from APM):
- Stripe Connect API (/v1/payment_intents): p99 jumped from 800ms to 27s
- Stripe status page: "Investigating elevated latency for Connect endpoints in us-east-1" (posted 18:35 UTC)
- All other downstream deps (auth-svc, fraud-svc) normal

Our config:
- Stripe call timeout: 30s (matches inbound)
- No circuit breaker on Stripe client
- No bulkhead — Stripe calls share the main worker thread pool

Customer impact:
- Failed checkouts: ~3000 in 5min
- Estimated lost GMV: $180k (so far)

On-call:
- 18:38 — pager
- 18:41 — confirmed Stripe is the culprit (status page + APM)
- 18:43 — debating: wait it out vs disable Stripe path entirely

Summary

SEV1

Checkout success rate dropped to 22% due to Stripe Connect API latency spike (p99 27s) exhausting the shared worker thread pool, causing HTTP 504 timeouts for 78% of requests. Estimated $180k GMV lost in 5 minutes.

Severity reasoning: User-facing outage with revenue impact ($180k GMV lost in 5 minutes), affecting 78% of checkout attempts. Full degradation of a critical payment path.

deepseek-chat·prompt v1·output: en·11813ms·1370↑ / 1823↓ tok·$0.00238

Root cause hypotheses

  • highStripe Connect API latency spike in us-east-1 causing downstream timeouts

    Evidence: APM shows Stripe /v1/payment_intents p99 jumped from 800ms to 27s; Stripe status page confirms elevated latency for Connect endpoints in us-east-1 since 18:35 UTC.

  • highThread pool exhaustion due to long-running Stripe calls blocking all worker threads

    Evidence: Thread pool utilization at 100% (200 threads blocked), inbound queue depth 4200 (limit 5000). No circuit breaker or bulkhead on Stripe client.

  • mediumStripe timeout (30s) matching inbound gateway timeout causing cascading failures

    Evidence: Stripe call timeout set to 30s, same as inbound gateway timeout. No timeout differentiation or retry budget.

  • lowRecent deployment or config change to checkout-svc increased Stripe call volume or changed behavior

    Evidence: No evidence in context; but should be checked via deployment history and recent config changes.

Investigation checklist

  1. Confirm Stripe status page and APM data for ongoing latency
    curl -s https://status.stripe.com | grep -i 'connect' ; check APM dashboard for Stripe p99 latency over last 10 minutes

    Expected: Stripe status shows 'Investigating' or 'Resolved'; APM shows Stripe p99 > 20s or returning to normal

  2. Check thread pool utilization and queue depth in checkout-svc
    kubectl exec deploy/checkout-svc -- curl -s localhost:8080/actuator/health | jq '.threadPool' ; or check metrics endpoint for 'tomcat.threads.busy' and 'tomcat.threads.config.max'

    Expected: Busy threads = 200 (max), queue depth near 5000

  3. Verify no other downstream dependencies are degraded
    kubectl exec deploy/checkout-svc -- curl -s localhost:8080/actuator/health | jq '.components' ; check APM for auth-svc and fraud-svc p99 latency

    Expected: auth-svc and fraud-svc p99 < 100ms, health checks passing

  4. Check recent deployments or config changes to checkout-svc
    kubectl rollout history deploy/checkout-svc ; kubectl get configmap checkout-svc -o yaml | grep -i stripe

    Expected: No recent changes; Stripe timeout still 30s, no circuit breaker config

  5. Check if Stripe calls are idempotent and can be retried safely
    grep -r 'idempotency' /etc/checkout-svc/config ; check Stripe API docs for idempotency keys

    Expected: Idempotency keys are used; retries are safe

Mitigation plan

  • Enable circuit breaker on Stripe client with timeout 5s, failure threshold 50% in 60s window, half-open after 30s

    Risk: May cause partial failures if circuit opens prematurely; but prevents thread pool exhaustion

    Rollback: Disable circuit breaker by reverting config change or setting threshold to 100%

  • Add bulkhead for Stripe calls: limit concurrent Stripe calls to 20 threads, queue 50

    Risk: May throttle legitimate traffic if bulkhead too small; monitor queue drops

    Rollback: Increase bulkhead size or remove bulkhead config

  • Reduce Stripe call timeout from 30s to 5s to fail fast and free threads

    Risk: May cause more failures if Stripe latency is transient; but prevents thread exhaustion

    Rollback: Revert timeout to 30s

  • If Stripe outage persists >10 min, disable Stripe payment path and fall back to 'payment method unavailable' message

    Risk: Lost revenue from Stripe payments; but prevents further failures

    Rollback: Re-enable Stripe path after confirming Stripe recovery

Customer impact

Approximately 78% of checkout attempts are failing with HTTP 504 errors. Users see a timeout or error page after waiting up to 30 seconds. Estimated 3000 failed checkouts in 5 minutes, with $180k lost GMV. No ETA for full recovery; Stripe is investigating.

Postmortem draft

Postmortem: Checkout-svc Outage (2025-03-27)

Summary

Brief description of the incident.

Timeline

  • 18:35 UTC: Stripe reports elevated latency for Connect endpoints.
  • 18:40 UTC: checkout-svc success rate drops to 22%, p99 latency 28s.
  • 18:41 UTC: On-call confirms Stripe is the culprit.
  • 18:43 UTC: Mitigation actions started (circuit breaker, bulkhead, timeout reduction).
  • [Time] UTC: Recovery confirmed.

Impact

  • 3000 failed checkouts, $180k lost GMV.
  • Users experienced 504 errors for 78% of attempts.

Root Cause

Stripe Connect API latency spike caused thread pool exhaustion in checkout-svc due to lack of circuit breaker and bulkhead.

What Went Well

  • Quick detection via APM and Stripe status page.
  • On-call identified root cause within 3 minutes.

What Went Poorly

  • No circuit breaker or bulkhead on Stripe client.
  • Stripe timeout (30s) matched inbound timeout, causing cascading failures.

Action Items

  • [ ] Add circuit breaker to Stripe client (P0)
  • [ ] Add bulkhead for Stripe calls (P0)
  • [ ] Reduce Stripe timeout to 5s (P1)
  • [ ] Implement fallback payment path (P1)
  • [ ] Review all downstream dependencies for similar issues (P2)

Follow-ups

  • P0Add circuit breaker to Stripe client with proper thresholdsservice owner
  • P0Add bulkhead for Stripe calls to isolate thread poolservice owner
  • P1Reduce Stripe call timeout to 5s and implement retry with backoffservice owner
  • P1Implement fallback payment method when Stripe is degradedplatform team
  • P2Review all downstream dependencies for circuit breaker and bulkhead patternsplatform team
  • P1Add alerting on thread pool utilization >80% and queue depth >1000on-call SRE