← All incidents

[Eval][v2][en] Third-party payment gateway timeouts cascade into checkout outage

service: checkout-svccreated: 6/2/2026, 11:56:58 PM

Raw incident context

Time: 18:40 UTC. checkout-svc starts returning HTTP 504 to ~78% of checkout attempts.

Symptoms:
- p99 latency: 28s (hitting our 30s gateway timeout)
- Success rate: 22% (down from 99.5%)
- Thread pool utilization: 100% (all 200 worker threads blocked)
- Inbound queue depth: 4200 requests (queue limit 5000)

Downstream calls (from APM):
- Stripe Connect API (/v1/payment_intents): p99 jumped from 800ms to 27s
- Stripe status page: "Investigating elevated latency for Connect endpoints in us-east-1" (posted 18:35 UTC)
- All other downstream deps (auth-svc, fraud-svc) normal

Our config:
- Stripe call timeout: 30s (matches inbound)
- No circuit breaker on Stripe client
- No bulkhead — Stripe calls share the main worker thread pool

Customer impact:
- Failed checkouts: ~3000 in 5min
- Estimated lost GMV: $180k (so far)

On-call:
- 18:38 — pager
- 18:41 — confirmed Stripe is the culprit (status page + APM)
- 18:43 — debating: wait it out vs disable Stripe path entirely

Summary

SEV1

Checkout service is experiencing a severe outage due to elevated latency from the Stripe Connect API, causing thread pool exhaustion and 78% of checkout attempts to fail with HTTP 504. The incident is ongoing with an estimated $180k GMV loss so far.

Severity reasoning: User-facing outage: checkout success rate dropped from 99.5% to 22% (>1% error rate for >5 min). Revenue path broken: estimated $180k GMV lost. Regional unavailability: Stripe Connect latency spike affects us-east-1. This meets SEV1 criteria.

deepseek-chat·prompt v2·output: en·13320ms·1751↑ / 2078↓ tok·$0.00276

Root cause hypotheses

  • highStripe Connect API latency spike due to upstream issue in us-east-1

    Evidence: Stripe status page: 'Investigating elevated latency for Connect endpoints in us-east-1' posted 18:35 UTC. APM shows p99 latency for /v1/payment_intents jumped from 800ms to 27s.

  • highThread pool exhaustion caused by all worker threads blocking on Stripe calls

    Evidence: Thread pool utilization at 100% (200 threads), inbound queue depth 4200 (limit 5000). No circuit breaker or bulkhead on Stripe client.

  • mediumStripe call timeout (30s) matches inbound gateway timeout, causing cascading failures

    Evidence: p99 latency 28s, hitting 30s timeout. Timeout configuration creates a tight coupling between downstream and upstream timeouts.

  • lowResource leak in Stripe HTTP client (e.g., connection pool exhaustion)

    Evidence: No direct evidence yet, but thread pool exhaustion could be exacerbated by connection pool starvation. Check connection pool metrics.

  • lowNetwork issue between our cluster and Stripe API (e.g., DNS, firewall, TLS)

    Evidence: No evidence of network errors in logs; latency is high but connections are succeeding. Stripe status page confirms their issue.

Investigation checklist

  1. Check Stripe status page for updates
    curl -s https://status.stripe.com/api/v2/status.json | jq '.status.description'

    Expected: Should show 'Investigating' or 'Identified' for Connect endpoints

  2. Verify Stripe API latency from our cluster
    kubectl exec -n prod deploy/checkout-svc -- curl -o /dev/null -s -w 'time_total: %{time_total}\n' -X POST https://api.stripe.com/v1/payment_intents -H 'Authorization: Bearer <redacted>' -d 'amount=100¤cy=usd' --connect-timeout 5 --max-time 30

    Expected: time_total should be <1s normally; if >5s, confirms Stripe latency

  3. Check thread pool and queue depth metrics
    kubectl exec -n prod deploy/checkout-svc -- curl -s localhost:8080/metrics | grep -E 'thread_pool|queue_depth'

    Expected: thread_pool_active == 200 (max), queue_depth near 5000

  4. Check Stripe HTTP client connection pool metrics
    kubectl exec -n prod deploy/checkout-svc -- curl -s localhost:8080/metrics | grep -E 'stripe_connections|http_client'

    Expected: If connection pool exhausted, active connections == max

  5. Check for any recent deployments or config changes
    kubectl rollout history deploy/checkout-svc -n prod

    Expected: No recent changes; if changes exist, check for timeout or thread pool config modifications

  6. Check downstream dependencies (auth-svc, fraud-svc) for any anomalies
    kubectl logs -n prod -l app=auth-svc --since=10m | grep -i error | tail -5; kubectl logs -n prod -l app=fraud-svc --since=10m | grep -i error | tail -5

    Expected: No errors; these services are normal

Mitigation plan

  • Enable circuit breaker on Stripe client to fail fast when latency exceeds threshold (e.g., 5s). This will protect thread pool from being blocked.

    Risk: Circuit breaker may trip prematurely if threshold too low, causing false positives. Safer than disabling Stripe entirely.

    Rollback: Disable circuit breaker by reverting config change or setting threshold to 30s (original timeout).

  • Implement bulkhead isolation for Stripe calls: dedicate a separate thread pool (e.g., 20 threads) for Stripe calls to prevent exhausting main worker pool.

    Risk: If Stripe calls exceed bulkhead capacity, they will be rejected, but main pool remains healthy. Requires code change; may not be immediate.

    Rollback: Revert bulkhead config to use shared pool.

  • Reduce Stripe call timeout from 30s to 5s to fail fast and free threads sooner. This reduces thread blocking duration.

    Risk: Some legitimate Stripe calls may timeout if latency is high but not failing. Trade-off between availability and success rate.

    Rollback: Revert timeout to 30s.

  • If Stripe issue persists and impact is unacceptable, disable Stripe payment path and fall back to an alternative payment method (e.g., PayPal) or show maintenance page.

    Risk: Lost revenue from Stripe payments; customer frustration. Destructive operation: must ensure fallback is tested.

    Rollback: Re-enable Stripe path by reverting feature flag or config.

Customer impact

Approximately 78% of checkout attempts are failing with a 504 error. In the last 5 minutes, about 3,000 customers were unable to complete purchases. Estimated lost revenue is $180,000 and growing. Customers see a timeout error page after 30 seconds.

Postmortem draft

Summary

[FILL IN: 2-3 sentence summary of incident]

Timeline (UTC)

  • 18:35: Stripe status page reports elevated latency for Connect endpoints in us-east-1
  • 18:40: checkout-svc starts returning 504 errors; success rate drops to 22%
  • 18:38: On-call paged
  • 18:41: Confirmed Stripe is the culprit via status page and APM
  • [FILL IN: mitigation actions and resolution time]

Impact

  • Checkout success rate: 22% (down from 99.5%)
  • p99 latency: 28s
  • Failed checkouts: ~3000 in 5min
  • Estimated GMV loss: $180k

Root Cause

Elevated latency from Stripe Connect API caused all worker threads to block on Stripe calls, exhausting the thread pool and causing cascading timeouts.

Detection

  • Automated alert: success rate dropped below threshold
  • On-call paged at 18:38
  • Stripe status page confirmed upstream issue

Response

  • 18:38: Pager acknowledged
  • 18:41: Identified Stripe as root cause
  • [FILL IN: mitigation steps taken]

What Went Well

  • Quick identification of Stripe as culprit via status page and APM
  • [FILL IN]

What Went Poorly

  • No circuit breaker or bulkhead on Stripe client
  • Timeout configuration matched gateway timeout, causing cascading failures
  • [FILL IN]

Action Items

  • [FILL IN: specific action items with owners and tickets]

Follow-ups

  • P0Implement circuit breaker for Stripe client with 5s timeout thresholdservice owner
  • P0Implement bulkhead isolation for Stripe calls with dedicated thread poolservice owner
  • P1Reduce Stripe call timeout from 30s to 5sservice owner
  • P1Add monitoring for thread pool utilization and queue depth with alertsplatform team
  • P2Review and update incident response runbook for Stripe-related outageson-call SRE
  • P2Evaluate fallback payment methods for Stripe unavailabilityproduct team