← All incidents

[Eval][v2][en] Third-party payment gateway timeouts cascade into checkout outage

service: checkout-svccreated: 6/2/2026, 11:57:29 PM

Raw incident context

Time: 18:40 UTC. checkout-svc starts returning HTTP 504 to ~78% of checkout attempts.

Symptoms:
- p99 latency: 28s (hitting our 30s gateway timeout)
- Success rate: 22% (down from 99.5%)
- Thread pool utilization: 100% (all 200 worker threads blocked)
- Inbound queue depth: 4200 requests (queue limit 5000)

Downstream calls (from APM):
- Stripe Connect API (/v1/payment_intents): p99 jumped from 800ms to 27s
- Stripe status page: "Investigating elevated latency for Connect endpoints in us-east-1" (posted 18:35 UTC)
- All other downstream deps (auth-svc, fraud-svc) normal

Our config:
- Stripe call timeout: 30s (matches inbound)
- No circuit breaker on Stripe client
- No bulkhead — Stripe calls share the main worker thread pool

Customer impact:
- Failed checkouts: ~3000 in 5min
- Estimated lost GMV: $180k (so far)

On-call:
- 18:38 — pager
- 18:41 — confirmed Stripe is the culprit (status page + APM)
- 18:43 — debating: wait it out vs disable Stripe path entirely

Summary

SEV1

Checkout service is failing ~78% of requests due to downstream Stripe Connect API latency spike (p99 27s vs normal 800ms). All 200 worker threads are blocked waiting on Stripe responses, causing thread pool exhaustion and 504 timeouts. Estimated $180k GMV lost in 5 minutes.

Severity reasoning: User-facing outage: success rate dropped from 99.5% to 22% (>1% error rate for >5 min). Revenue path broken (checkout failures). Regional unavailability of Stripe Connect in us-east-1. Meets SEV1 criteria.

deepseek-chat·prompt v2·output: en·11633ms·1751↑ / 1796↓ tok·$0.00245

Root cause hypotheses

  • highStripe Connect API latency spike in us-east-1 due to internal Stripe issue

    Evidence: Stripe status page: 'Investigating elevated latency for Connect endpoints in us-east-1' (18:35 UTC). APM shows p99 latency for /v1/payment_intents jumped from 800ms to 27s.

  • highNo circuit breaker on Stripe client causes cascading thread pool exhaustion

    Evidence: Thread pool utilization 100%, all 200 threads blocked. No circuit breaker or bulkhead configured for Stripe calls.

  • mediumStripe call timeout (30s) matches inbound gateway timeout, preventing fast failure

    Evidence: Inbound timeout is 30s, Stripe timeout is 30s. Requests wait full timeout before returning 504, accumulating queue depth.

  • lowQueue limit (5000) too high, allowing excessive backlog before rejection

    Evidence: Inbound queue depth 4200, approaching limit 5000. No early rejection mechanism.

Investigation checklist

  1. Check Stripe status page for updates
    curl -s https://status.stripe.com/api/v2/status.json | jq '.status.description'

    Expected: Should show 'Investigating elevated latency for Connect endpoints in us-east-1' or resolved status

  2. Verify Stripe Connect latency from checkout-svc pods
    kubectl exec -n prod deploy/checkout-svc -- curl -o /dev/null -s -w '%{time_total}\n' --max-time 5 https://api.stripe.com/v1/payment_intents -H 'Authorization: Bearer <redacted>' 2>&1 | tail -5

    Expected: Latency >5s indicates ongoing issue; <1s indicates recovery

  3. Check thread pool and queue depth in checkout-svc metrics
    kubectl exec -n prod deploy/checkout-svc -- curl -s localhost:8080/metrics | grep -E 'checkout_thread_pool_utilization|checkout_queue_depth'

    Expected: Thread pool utilization ~100%, queue depth >4000

  4. Check if circuit breaker is configured for Stripe client
    kubectl exec -n prod deploy/checkout-svc -- cat /app/config.yaml | grep -A5 'stripe'

    Expected: No circuit breaker or bulkhead configuration present

  5. Check other downstream services (auth-svc, fraud-svc) for latency
    kubectl logs -n prod -l app=checkout-svc --since=10m | grep -E 'auth-svc|fraud-svc' | awk '{print $NF}' | sort | uniq -c | sort -rn

    Expected: Latency normal (<100ms) for auth-svc and fraud-svc

Mitigation plan

  • Enable circuit breaker for Stripe client with failure threshold of 5 consecutive failures and half-open after 30s

    Risk: May temporarily reject valid requests if circuit opens; but prevents thread pool exhaustion. Safer than disabling Stripe entirely.

    Rollback: Disable circuit breaker by reverting config change or setting threshold to 0

  • Reduce Stripe call timeout from 30s to 5s to fail fast and free threads

    Risk: Some legitimate slow requests may fail; but 5s is still above normal p99 (800ms). Reduces thread hold time.

    Rollback: Revert timeout to 30s

  • Implement bulkhead for Stripe calls: limit to 50 concurrent threads, reserve 150 for other deps

    Risk: May throttle Stripe calls if limit reached; but protects main thread pool. Requires code change or config reload.

    Rollback: Remove bulkhead limit or increase to 200

Customer impact

Approximately 78% of checkout attempts are failing with HTTP 504 errors. Estimated 3000 failed checkouts in 5 minutes, resulting in ~$180k lost GMV. Customers see 'Something went wrong, please try again' after a long timeout. No data loss; payments are not charged.

Postmortem draft

Summary

[FILL IN: 2-3 sentence summary]

Timeline (UTC)

  • 18:35 - Stripe status page reports elevated latency for Connect endpoints in us-east-1
  • 18:38 - Pager received for checkout-svc high error rate
  • 18:41 - Confirmed Stripe is the culprit via status page and APM
  • 18:43 - Team debating mitigation options
  • [FILL IN: resolution time]

Impact

  • Checkout success rate dropped from 99.5% to 22%
  • p99 latency: 28s
  • ~3000 failed checkouts in 5 minutes
  • Estimated $180k GMV lost

Root Cause

Stripe Connect API latency spike in us-east-1 caused all checkout-svc worker threads to block waiting for responses, leading to thread pool exhaustion and 504 timeouts. Lack of circuit breaker and bulkhead allowed the failure to cascade.

Detection

Pager triggered by error rate alert at 18:38. Stripe status page confirmed external cause.

Response

  • [FILL IN: actions taken]

What Went Well

  • Quick identification of Stripe as root cause via status page
  • APM data clearly showed downstream latency

What Went Poorly

  • No circuit breaker on Stripe client
  • No bulkhead isolation for downstream calls
  • Timeout configuration matched inbound timeout, preventing fast failure

Action Items

  • [FILL IN: action items from follow_ups]

Follow-ups

  • P0Add circuit breaker to Stripe client with configurable thresholdsservice owner
  • P0Implement bulkhead isolation for all downstream dependenciesplatform team
  • P1Reduce Stripe call timeout to 5s and ensure it's less than inbound timeoutservice owner
  • P1Add alert on thread pool utilization >80% and queue depth >80% of limiton-call SRE
  • P2Review queue limit configuration; consider lower limit or early rejectionservice owner
  • P2Document Stripe dependency failure modes and run chaos experimentplatform team