[Eval][v2][en] Third-party payment gateway timeouts cascade into checkout outage
service: checkout-svccreated: 6/2/2026, 11:56:58 PM
Raw incident context
Time: 18:40 UTC. checkout-svc starts returning HTTP 504 to ~78% of checkout attempts. Symptoms: - p99 latency: 28s (hitting our 30s gateway timeout) - Success rate: 22% (down from 99.5%) - Thread pool utilization: 100% (all 200 worker threads blocked) - Inbound queue depth: 4200 requests (queue limit 5000) Downstream calls (from APM): - Stripe Connect API (/v1/payment_intents): p99 jumped from 800ms to 27s - Stripe status page: "Investigating elevated latency for Connect endpoints in us-east-1" (posted 18:35 UTC) - All other downstream deps (auth-svc, fraud-svc) normal Our config: - Stripe call timeout: 30s (matches inbound) - No circuit breaker on Stripe client - No bulkhead — Stripe calls share the main worker thread pool Customer impact: - Failed checkouts: ~3000 in 5min - Estimated lost GMV: $180k (so far) On-call: - 18:38 — pager - 18:41 — confirmed Stripe is the culprit (status page + APM) - 18:43 — debating: wait it out vs disable Stripe path entirely
Summary
Checkout service is experiencing a severe outage due to elevated latency from the Stripe Connect API, causing thread pool exhaustion and 78% of checkout attempts to fail with HTTP 504. The incident is ongoing with an estimated $180k GMV loss so far.
Severity reasoning: User-facing outage: checkout success rate dropped from 99.5% to 22% (>1% error rate for >5 min). Revenue path broken: estimated $180k GMV lost. Regional unavailability: Stripe Connect latency spike affects us-east-1. This meets SEV1 criteria.
deepseek-chat·prompt v2·output: en·13320ms·1751↑ / 2078↓ tok·$0.00276
Root cause hypotheses
- highStripe Connect API latency spike due to upstream issue in us-east-1
Evidence: Stripe status page: 'Investigating elevated latency for Connect endpoints in us-east-1' posted 18:35 UTC. APM shows p99 latency for /v1/payment_intents jumped from 800ms to 27s.
- highThread pool exhaustion caused by all worker threads blocking on Stripe calls
Evidence: Thread pool utilization at 100% (200 threads), inbound queue depth 4200 (limit 5000). No circuit breaker or bulkhead on Stripe client.
- mediumStripe call timeout (30s) matches inbound gateway timeout, causing cascading failures
Evidence: p99 latency 28s, hitting 30s timeout. Timeout configuration creates a tight coupling between downstream and upstream timeouts.
- lowResource leak in Stripe HTTP client (e.g., connection pool exhaustion)
Evidence: No direct evidence yet, but thread pool exhaustion could be exacerbated by connection pool starvation. Check connection pool metrics.
- lowNetwork issue between our cluster and Stripe API (e.g., DNS, firewall, TLS)
Evidence: No evidence of network errors in logs; latency is high but connections are succeeding. Stripe status page confirms their issue.
Investigation checklist
- Check Stripe status page for updates
curl -s https://status.stripe.com/api/v2/status.json | jq '.status.description'Expected: Should show 'Investigating' or 'Identified' for Connect endpoints
- Verify Stripe API latency from our cluster
kubectl exec -n prod deploy/checkout-svc -- curl -o /dev/null -s -w 'time_total: %{time_total}\n' -X POST https://api.stripe.com/v1/payment_intents -H 'Authorization: Bearer <redacted>' -d 'amount=100¤cy=usd' --connect-timeout 5 --max-time 30Expected: time_total should be <1s normally; if >5s, confirms Stripe latency
- Check thread pool and queue depth metrics
kubectl exec -n prod deploy/checkout-svc -- curl -s localhost:8080/metrics | grep -E 'thread_pool|queue_depth'Expected: thread_pool_active == 200 (max), queue_depth near 5000
- Check Stripe HTTP client connection pool metrics
kubectl exec -n prod deploy/checkout-svc -- curl -s localhost:8080/metrics | grep -E 'stripe_connections|http_client'Expected: If connection pool exhausted, active connections == max
- Check for any recent deployments or config changes
kubectl rollout history deploy/checkout-svc -n prodExpected: No recent changes; if changes exist, check for timeout or thread pool config modifications
- Check downstream dependencies (auth-svc, fraud-svc) for any anomalies
kubectl logs -n prod -l app=auth-svc --since=10m | grep -i error | tail -5; kubectl logs -n prod -l app=fraud-svc --since=10m | grep -i error | tail -5Expected: No errors; these services are normal
Mitigation plan
Enable circuit breaker on Stripe client to fail fast when latency exceeds threshold (e.g., 5s). This will protect thread pool from being blocked.
Risk: Circuit breaker may trip prematurely if threshold too low, causing false positives. Safer than disabling Stripe entirely.
Rollback: Disable circuit breaker by reverting config change or setting threshold to 30s (original timeout).
Implement bulkhead isolation for Stripe calls: dedicate a separate thread pool (e.g., 20 threads) for Stripe calls to prevent exhausting main worker pool.
Risk: If Stripe calls exceed bulkhead capacity, they will be rejected, but main pool remains healthy. Requires code change; may not be immediate.
Rollback: Revert bulkhead config to use shared pool.
Reduce Stripe call timeout from 30s to 5s to fail fast and free threads sooner. This reduces thread blocking duration.
Risk: Some legitimate Stripe calls may timeout if latency is high but not failing. Trade-off between availability and success rate.
Rollback: Revert timeout to 30s.
If Stripe issue persists and impact is unacceptable, disable Stripe payment path and fall back to an alternative payment method (e.g., PayPal) or show maintenance page.
Risk: Lost revenue from Stripe payments; customer frustration. Destructive operation: must ensure fallback is tested.
Rollback: Re-enable Stripe path by reverting feature flag or config.
Customer impact
Approximately 78% of checkout attempts are failing with a 504 error. In the last 5 minutes, about 3,000 customers were unable to complete purchases. Estimated lost revenue is $180,000 and growing. Customers see a timeout error page after 30 seconds.
Postmortem draft
Summary
[FILL IN: 2-3 sentence summary of incident]
Timeline (UTC)
- 18:35: Stripe status page reports elevated latency for Connect endpoints in us-east-1
- 18:40: checkout-svc starts returning 504 errors; success rate drops to 22%
- 18:38: On-call paged
- 18:41: Confirmed Stripe is the culprit via status page and APM
- [FILL IN: mitigation actions and resolution time]
Impact
- Checkout success rate: 22% (down from 99.5%)
- p99 latency: 28s
- Failed checkouts: ~3000 in 5min
- Estimated GMV loss: $180k
Root Cause
Elevated latency from Stripe Connect API caused all worker threads to block on Stripe calls, exhausting the thread pool and causing cascading timeouts.
Detection
- Automated alert: success rate dropped below threshold
- On-call paged at 18:38
- Stripe status page confirmed upstream issue
Response
- 18:38: Pager acknowledged
- 18:41: Identified Stripe as root cause
- [FILL IN: mitigation steps taken]
What Went Well
- Quick identification of Stripe as culprit via status page and APM
- [FILL IN]
What Went Poorly
- No circuit breaker or bulkhead on Stripe client
- Timeout configuration matched gateway timeout, causing cascading failures
- [FILL IN]
Action Items
- [FILL IN: specific action items with owners and tickets]
Follow-ups
- P0Implement circuit breaker for Stripe client with 5s timeout threshold— service owner
- P0Implement bulkhead isolation for Stripe calls with dedicated thread pool— service owner
- P1Reduce Stripe call timeout from 30s to 5s— service owner
- P1Add monitoring for thread pool utilization and queue depth with alerts— platform team
- P2Review and update incident response runbook for Stripe-related outages— on-call SRE
- P2Evaluate fallback payment methods for Stripe unavailability— product team
Similar past incidents
lexical match (pg_trgm)
- 61%
[Eval][v1][en] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 59%
[Eval][v2][en] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 52%
[Eval][v1][zh] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 49%
[Eval][v2][zh] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 34%
[Eval][v2][en] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.