← All incidents

[Eval][v1][en] Third-party payment gateway timeouts cascade into checkout outage

service: checkout-svccreated: 6/2/2026, 11:56:00 PM

Raw incident context

Time: 18:40 UTC. checkout-svc starts returning HTTP 504 to ~78% of checkout attempts.

Symptoms:
- p99 latency: 28s (hitting our 30s gateway timeout)
- Success rate: 22% (down from 99.5%)
- Thread pool utilization: 100% (all 200 worker threads blocked)
- Inbound queue depth: 4200 requests (queue limit 5000)

Downstream calls (from APM):
- Stripe Connect API (/v1/payment_intents): p99 jumped from 800ms to 27s
- Stripe status page: "Investigating elevated latency for Connect endpoints in us-east-1" (posted 18:35 UTC)
- All other downstream deps (auth-svc, fraud-svc) normal

Our config:
- Stripe call timeout: 30s (matches inbound)
- No circuit breaker on Stripe client
- No bulkhead — Stripe calls share the main worker thread pool

Customer impact:
- Failed checkouts: ~3000 in 5min
- Estimated lost GMV: $180k (so far)

On-call:
- 18:38 — pager
- 18:41 — confirmed Stripe is the culprit (status page + APM)
- 18:43 — debating: wait it out vs disable Stripe path entirely

Summary

SEV1

Checkout success rate dropped to 22% due to Stripe Connect API latency spike (p99 27s) exhausting the shared worker thread pool. All 200 threads blocked waiting on Stripe responses, causing HTTP 504s for 78% of requests. Estimated $180k GMV lost in 5 minutes.

Severity reasoning: User-facing outage with revenue impact (estimated $180k GMV lost). Full degradation of checkout for 78% of users, meeting SEV1 criteria.

deepseek-chat·prompt v1·output: en·10404ms·1370↑ / 1638↓ tok·$0.00217

Root cause hypotheses

  • highStripe Connect API latency spike in us-east-1 causes thread pool exhaustion

    Evidence: APM shows Stripe /v1/payment_intents p99 jumped from 800ms to 27s; Stripe status page confirms elevated latency for Connect endpoints in us-east-1; thread pool utilization at 100%

  • highNo circuit breaker on Stripe client allows unbounded retries or queuing

    Evidence: Config review shows no circuit breaker; thread pool exhaustion consistent with all threads blocking on Stripe calls

  • mediumStripe call timeout (30s) matches inbound timeout, causing cascading failures

    Evidence: p99 latency 28s hits 30s timeout; inbound queue depth 4200/5000 indicates requests piling up

  • highShared thread pool without bulkhead allows Stripe latency to block all checkout requests

    Evidence: No bulkhead; Stripe calls share main worker pool; all 200 threads blocked

Investigation checklist

  1. Confirm Stripe latency via APM
    curl -s 'http://apm-dashboard:8080/api/v1/services/checkout-svc/transactions?filter=stripe' | jq '.latency.p99'

    Expected: Value > 20s, confirming Stripe is the bottleneck

  2. Check thread pool metrics
    kubectl exec deploy/checkout-svc -- curl -s localhost:8080/metrics | grep -E 'thread_pool_active|thread_pool_queue'

    Expected: active threads = 200, queue depth > 4000

  3. Verify Stripe status page
    curl -s 'https://status.stripe.com/api/v1/incidents?filter=us-east-1' | jq '.incidents[0].status'

    Expected: "investigating" or "identified" for Connect endpoints

  4. Check if circuit breaker exists in Stripe client config
    kubectl exec deploy/checkout-svc -- cat /app/config.yaml | grep -A5 'stripe'

    Expected: No circuit_breaker or bulkhead settings present

  5. Check inbound queue depth trend
    kubectl exec deploy/checkout-svc -- curl -s localhost:8080/metrics | grep 'inbound_queue_depth'

    Expected: Value near 5000 (queue limit), indicating backpressure

Mitigation plan

  • Enable circuit breaker on Stripe client with failure threshold 5, timeout 10s, half-open after 30s

    Risk: May temporarily reject valid payments if Stripe recovers slowly; no data loss

    Rollback: Revert config change and restart checkout-svc pods

  • Reduce Stripe call timeout from 30s to 5s to fail fast and free threads

    Risk: Some legitimate payments may fail if Stripe latency >5s; no data loss

    Rollback: Revert timeout to 30s and restart pods

  • If Stripe outage persists, disable Stripe payment path and fall back to 'payment failed' page

    Risk: All Stripe payments will fail; no data loss; users can retry later

    Rollback: Re-enable Stripe path via feature flag

Customer impact

Approximately 78% of checkout attempts are failing with HTTP 504 errors. Users cannot complete purchases using any payment method that goes through Stripe. Estimated 3000 failed checkouts in 5 minutes, with $180k GMV lost. No ETA for resolution; Stripe is investigating.

Postmortem draft

Postmortem: Checkout Outage due to Stripe Latency

Summary:

  • Date: [Date]
  • Duration: [Start] - [End]
  • Severity: SEV1

Timeline:

  • 18:35 UTC - Stripe reports elevated latency for Connect endpoints
  • 18:40 UTC - checkout-svc success rate drops to 22%
  • 18:41 UTC - On-call confirms Stripe is culprit
  • [Mitigation time] - Circuit breaker enabled
  • [Recovery time] - Success rate returns to normal

Impact:

  • 3000 failed checkouts, $180k GMV lost
  • p99 latency 28s, thread pool exhausted

Root Cause:

  • Stripe Connect API latency spike caused all worker threads to block, exhausting thread pool
  • No circuit breaker or bulkhead on Stripe client

What Went Well:

  • Quick detection via APM and Stripe status page
  • On-call identified root cause within 3 minutes

What Went Poorly:

  • No circuit breaker or bulkhead in place
  • Stripe timeout matched inbound timeout, causing cascading failures

Action Items:

  • [ ] Add circuit breaker to Stripe client (P0)
  • [ ] Add bulkhead for Stripe calls (P1)
  • [ ] Reduce Stripe timeout to 5s (P1)
  • [ ] Add alert on thread pool utilization >80% (P1)

Follow-ups

  • P0Add circuit breaker to Stripe client with proper thresholdsservice owner
  • P1Implement bulkhead isolation for Stripe callsservice owner
  • P1Reduce Stripe call timeout to 5sservice owner
  • P1Add alert on thread pool utilization >80%on-call SRE
  • P2Review all downstream dependencies for circuit breaker and bulkhead patternsplatform team
  • P2Update runbook for Stripe outage scenarioson-call SRE