[Eval][v2][en] Third-party payment gateway timeouts cascade into checkout outage
service: checkout-svccreated: 6/2/2026, 11:38:28 PM
Raw incident context
Time: 18:40 UTC. checkout-svc starts returning HTTP 504 to ~78% of checkout attempts. Symptoms: - p99 latency: 28s (hitting our 30s gateway timeout) - Success rate: 22% (down from 99.5%) - Thread pool utilization: 100% (all 200 worker threads blocked) - Inbound queue depth: 4200 requests (queue limit 5000) Downstream calls (from APM): - Stripe Connect API (/v1/payment_intents): p99 jumped from 800ms to 27s - Stripe status page: "Investigating elevated latency for Connect endpoints in us-east-1" (posted 18:35 UTC) - All other downstream deps (auth-svc, fraud-svc) normal Our config: - Stripe call timeout: 30s (matches inbound) - No circuit breaker on Stripe client - No bulkhead — Stripe calls share the main worker thread pool Customer impact: - Failed checkouts: ~3000 in 5min - Estimated lost GMV: $180k (so far) On-call: - 18:38 — pager - 18:41 — confirmed Stripe is the culprit (status page + APM) - 18:43 — debating: wait it out vs disable Stripe path entirely
Summary
Checkout service is failing for ~78% of users due to Stripe Connect API latency spike (p99 27s vs 800ms baseline). All 200 worker threads are blocked waiting on Stripe responses, causing thread pool exhaustion and 504 timeouts. Estimated $180k GMV lost in 5 minutes.
Severity reasoning: User-facing outage: checkout success rate dropped from 99.5% to 22% (>1% error rate for >5 min). Revenue path broken: estimated $180k GMV lost. Regional unavailability: Stripe us-east-1 outage affecting all users.
deepseek-chat·prompt v2·output: en·12984ms·1751↑ / 2078↓ tok·$0.00276
Root cause hypotheses
- highStripe Connect API latency spike due to upstream outage in us-east-1
Evidence: Stripe status page: 'Investigating elevated latency for Connect endpoints in us-east-1' posted 18:35 UTC. APM shows p99 latency for /v1/payment_intents jumped from 800ms to 27s at 18:40 UTC.
- highNo circuit breaker on Stripe client causes cascading thread pool exhaustion
Evidence: Thread pool utilization at 100% (200 threads blocked). Inbound queue depth 4200. No circuit breaker or bulkhead configured for Stripe calls.
- mediumStripe timeout (30s) matches inbound gateway timeout, preventing fast failure
Evidence: p99 latency 28s, hitting 30s timeout. No shorter timeout for Stripe calls to fail fast.
- lowRecent code change increased Stripe call volume or changed retry behavior
Evidence: No direct evidence in context. Could be investigated via recent deployments or config changes.
- lowNetwork issue between our cluster and Stripe (e.g., DNS, TLS, routing)
Evidence: No evidence of network errors; Stripe status page confirms their issue. Low likelihood given external confirmation.
Investigation checklist
- Check Stripe status page for updates
curl -s https://status.stripe.com/api/v2/status.json | jq '.status.description'Expected: Should show 'Investigating elevated latency for Connect endpoints in us-east-1' or updated status
- Check thread pool and queue depth in checkout-svc
kubectl exec -n prod deploy/checkout-svc -- curl -s localhost:8080/actuator/health | jq '.details.threadPool'Expected: threadPool.activeCount == 200, queueDepth > 4000
- Check APM for Stripe call latency breakdown
curl -s 'http://apm-dashboard.internal/d/checkout?var-service=checkout-svc&var-downstream=stripe-connect&from=now-30m' | jq '.series[0].values[-1][1]'Expected: p99 latency > 25s
- Verify other downstream services are healthy
kubectl logs -n prod -l app=auth-svc --since=15m | grep -c 'error'Expected: error count < 10 (normal baseline)
- Check recent deployments to checkout-svc
kubectl rollout history -n prod deploy/checkout-svcExpected: Last change > 1 hour ago (no recent deploy)
- Check Stripe API key and connectivity
kubectl exec -n prod deploy/checkout-svc -- curl -s -o /dev/null -w '%{http_code}' --connect-timeout 5 https://api.stripe.com/v1/payment_intentsExpected: 200 or 401 (not timeout)
Mitigation plan
Enable circuit breaker for Stripe calls with fast-fail timeout (e.g., 5s) and fallback to error response
Risk: May cause partial checkout failures if circuit opens, but prevents thread pool exhaustion. Safer than disabling Stripe entirely.
Rollback: Disable circuit breaker via feature flag: kubectl set env deploy/checkout-svc CIRCUIT_BREAKER_ENABLED=false
Reduce Stripe call timeout from 30s to 5s to fail fast and free threads
Risk: May cause more timeouts if Stripe latency is transient, but threads will be released quickly. Low risk.
Rollback: Revert timeout: kubectl set env deploy/checkout-svc STRIPE_TIMEOUT=30
Implement bulkhead for Stripe calls (max 20 concurrent threads) to protect main thread pool
Risk: May limit throughput to Stripe, but prevents total exhaustion. Requires code change or config reload.
Rollback: Remove bulkhead: kubectl set env deploy/checkout-svc STRIPE_BULKHEAD_MAX=200
If Stripe outage persists >15 min, disable Stripe payment path and fall back to alternative payment provider or show maintenance page
Risk: Lost revenue from Stripe payments; customers may be redirected. Destructive if not coordinated with business.
Rollback: Re-enable Stripe: kubectl set env deploy/checkout-svc STRIPE_ENABLED=true
Customer impact
Approximately 78% of checkout attempts are failing with a 504 error. In the last 5 minutes, about 3,000 checkouts have failed, resulting in an estimated $180,000 in lost gross merchandise value. Users see a timeout error page after 30 seconds. No ETA for resolution; Stripe is investigating their own issue.
Postmortem draft
Summary
[FILL IN: 2-3 sentence summary of incident]
Timeline (UTC)
- 18:35 - Stripe status page: 'Investigating elevated latency for Connect endpoints in us-east-1'
- 18:40 - checkout-svc success rate drops to 22%, p99 latency 28s
- 18:38 - On-call paged
- 18:41 - Confirmed Stripe is culprit via status page and APM
- 18:43 - Debate: wait vs disable Stripe
- [FILL IN: mitigation actions and resolution time]
Impact
- Error rate: ~78% of checkout requests failed (504)
- Duration: [FILL IN: start to end time]
- Lost GMV: ~$180k (estimated)
- Affected users: all users attempting checkout
Root Cause
Stripe Connect API latency spike in us-east-1 caused all checkout-svc worker threads to block waiting for responses, leading to thread pool exhaustion and cascading timeouts.
Detection
- Automated alert: success rate < 95% triggered pager at 18:38
- Stripe status page confirmed external issue
Response
- [FILL IN: actions taken, e.g., enabled circuit breaker, reduced timeout]
What Went Well
- Quick identification of Stripe as root cause via APM and status page
- On-call engineer paged promptly
What Went Poorly
- No circuit breaker or bulkhead for Stripe calls
- Timeout configuration matched gateway timeout, preventing fast failure
- No runbook for Stripe outage scenario
Action Items
- [FILL IN: action items from follow_ups]
Follow-ups
- P0Add circuit breaker to Stripe client with 5s timeout and fallback— service owner
- P0Implement bulkhead isolation for Stripe calls (max 20 concurrent threads)— service owner
- P1Reduce Stripe call timeout to 5s (configurable via env var)— service owner
- P1Create runbook for Stripe outage scenario with step-by-step mitigation— on-call SRE
- P1Add alert on thread pool utilization > 80%— platform team
- P2Review and test fallback payment provider integration— service owner
- P1Postmortem meeting within 48 hours— on-call SRE
Similar past incidents
lexical match (pg_trgm)
- 91%
[Eval][v2][en] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 81%
[Eval][v1][en] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 49%
[Eval][v1][zh] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 46%
[Eval][v2][zh] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 37%
[Eval][v1][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts