[Eval][v2][en] Third-party payment gateway timeouts cascade into checkout outage
service: checkout-svccreated: 6/2/2026, 11:57:12 PM
Raw incident context
Time: 18:40 UTC. checkout-svc starts returning HTTP 504 to ~78% of checkout attempts. Symptoms: - p99 latency: 28s (hitting our 30s gateway timeout) - Success rate: 22% (down from 99.5%) - Thread pool utilization: 100% (all 200 worker threads blocked) - Inbound queue depth: 4200 requests (queue limit 5000) Downstream calls (from APM): - Stripe Connect API (/v1/payment_intents): p99 jumped from 800ms to 27s - Stripe status page: "Investigating elevated latency for Connect endpoints in us-east-1" (posted 18:35 UTC) - All other downstream deps (auth-svc, fraud-svc) normal Our config: - Stripe call timeout: 30s (matches inbound) - No circuit breaker on Stripe client - No bulkhead — Stripe calls share the main worker thread pool Customer impact: - Failed checkouts: ~3000 in 5min - Estimated lost GMV: $180k (so far) On-call: - 18:38 — pager - 18:41 — confirmed Stripe is the culprit (status page + APM) - 18:43 — debating: wait it out vs disable Stripe path entirely
Summary
Checkout success rate dropped from 99.5% to 22% due to elevated latency from Stripe Connect API, causing thread pool exhaustion and 504 timeouts. ~3000 failed checkouts in 5 minutes with estimated $180k GMV loss. Incident is ongoing.
Severity reasoning: User-facing outage: checkout success rate < 1% (22%) for >5 min, revenue path broken (Stripe payment processing), and regional unavailability of Stripe Connect impacting all users. Meets SEV1 criteria.
deepseek-chat·prompt v2·output: en·10517ms·1751↑ / 1669↓ tok·$0.00231
Root cause hypotheses
- highStripe Connect API latency spike due to upstream issue in us-east-1
Evidence: Stripe status page: 'Investigating elevated latency for Connect endpoints in us-east-1' (18:35 UTC). APM shows p99 latency for /v1/payment_intents jumped from 800ms to 27s.
- highNo circuit breaker on Stripe client causing cascading thread pool exhaustion
Evidence: Thread pool utilization at 100% (200 workers blocked), inbound queue depth 4200. Stripe calls share main worker pool with no bulkhead.
- mediumStripe timeout (30s) matches inbound gateway timeout, preventing fast failure
Evidence: p99 latency 28s, hitting 30s timeout. No shorter timeout or circuit breaker to fail fast.
- lowRecent deployment or config change to checkout-svc increased Stripe call volume
Evidence: No direct evidence in context; low likelihood as Stripe status page indicates upstream issue.
Investigation checklist
- Check Stripe status page for updates
curl -s https://status.stripe.com/ | grep -i 'connect'Expected: Status should show 'Investigating' or 'Resolved' for Connect endpoints.
- Verify Stripe latency from checkout-svc pods
kubectl exec -n prod deploy/checkout-svc -- curl -o /dev/null -s -w '%{time_total}\n' https://api.stripe.com/v1/payment_intents -H 'Authorization: Bearer <redacted>'Expected: Response time should be <1s normally; if >10s, confirms Stripe latency.
- Check thread pool and queue depth in checkout-svc
kubectl exec -n prod deploy/checkout-svc -- curl -s localhost:8080/actuator/threadpool | jq '.threadPool.utilization, .queueDepth'Expected: Utilization should be <80%; queue depth <1000. Currently 100% and 4200.
- Check for recent deployments to checkout-svc
kubectl rollout history deployment -n prod checkout-svcExpected: No recent changes; if recent, check diff.
- Check downstream dependencies (auth-svc, fraud-svc) latency
kubectl exec -n prod deploy/checkout-svc -- curl -o /dev/null -s -w '%{time_total}\n' http://auth-svc:8080/healthExpected: Response time <100ms; if elevated, indicates broader issue.
Mitigation plan
Enable circuit breaker on Stripe client with fallback to fail fast (return 503 immediately) to protect thread pool
Risk: All Stripe payments will fail fast, but thread pool will recover. No data loss.
Rollback: Disable circuit breaker via feature flag or config reload.
Reduce Stripe timeout from 30s to 5s to fail fast and free threads
Risk: Some legitimate payments may timeout if Stripe latency is high but not all requests are slow. Safer than current state.
Rollback: Revert timeout to 30s via config push.
Implement bulkhead for Stripe calls (dedicated thread pool of 20 threads) to isolate from main pool
Risk: Requires code change; cannot be done quickly. Not recommended for immediate mitigation.
Rollback: Revert to shared pool via config.
Customer impact
Approximately 78% of checkout attempts are failing with HTTP 504 errors. ~3000 failed checkouts in 5 minutes, estimated $180k lost GMV. Customers cannot complete purchases. No ETA for resolution; dependent on Stripe recovery.
Postmortem draft
Summary
[FILL IN]
Timeline (UTC)
- 18:35: Stripe status page reports elevated latency for Connect endpoints in us-east-1
- 18:38: Pager alert for checkout-svc high error rate
- 18:41: Confirmed Stripe is the culprit via status page and APM
- 18:43: Debating mitigation options
Impact
- Checkout success rate dropped from 99.5% to 22%
- p99 latency 28s
- ~3000 failed checkouts, $180k GMV loss
Root Cause
Stripe Connect API latency spike caused thread pool exhaustion in checkout-svc due to lack of circuit breaker and bulkhead.
Detection
Pager alert for error rate threshold breach.
Response
[FILL IN]
What Went Well
- Quick identification of Stripe as culprit via status page
- APM data clearly showed downstream latency
What Went Poorly
- No circuit breaker on Stripe client
- No bulkhead isolation
- Timeout too long (30s)
Action Items
- [ ] Add circuit breaker to Stripe client
- [ ] Implement bulkhead for Stripe calls
- [ ] Reduce Stripe timeout to 5s
- [ ] Add alert on thread pool utilization
Follow-ups
- P0Add circuit breaker to Stripe HTTP client with fallback— checkout-svc service owner
- P0Implement bulkhead (dedicated thread pool) for Stripe calls— checkout-svc service owner
- P1Reduce Stripe API timeout from 30s to 5s— checkout-svc service owner
- P1Add alert on thread pool utilization >80%— platform team
- P2Review and improve incident response runbook for Stripe dependencies— on-call SRE
- P2Evaluate adding Stripe status page as a monitoring source— platform team
Similar past incidents
lexical match (pg_trgm)
- 66%
[Eval][v1][en] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 63%
[Eval][v2][en] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 57%
[Eval][v1][zh] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 54%
[Eval][v2][zh] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 33%
[Eval][v1][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts