[Eval][v3][en] Third-party payment gateway timeouts cascade into checkout outage
service: checkout-svccreated: 6/2/2026, 11:58:33 PM
Raw incident context
Time: 18:40 UTC. checkout-svc starts returning HTTP 504 to ~78% of checkout attempts. Symptoms: - p99 latency: 28s (hitting our 30s gateway timeout) - Success rate: 22% (down from 99.5%) - Thread pool utilization: 100% (all 200 worker threads blocked) - Inbound queue depth: 4200 requests (queue limit 5000) Downstream calls (from APM): - Stripe Connect API (/v1/payment_intents): p99 jumped from 800ms to 27s - Stripe status page: "Investigating elevated latency for Connect endpoints in us-east-1" (posted 18:35 UTC) - All other downstream deps (auth-svc, fraud-svc) normal Our config: - Stripe call timeout: 30s (matches inbound) - No circuit breaker on Stripe client - No bulkhead — Stripe calls share the main worker thread pool Customer impact: - Failed checkouts: ~3000 in 5min - Estimated lost GMV: $180k (so far) On-call: - 18:38 — pager - 18:41 — confirmed Stripe is the culprit (status page + APM) - 18:43 — debating: wait it out vs disable Stripe path entirely
Summary
Checkout success rate dropped from 99.5% to 22% due to elevated latency from Stripe Connect API, causing thread pool exhaustion and HTTP 504 errors. Approximately 3000 checkouts failed in 5 minutes, with an estimated $180k GMV loss. The incident is ongoing as of 18:43 UTC.
Severity reasoning: User-facing outage: error rate > 1% (78% failure rate) for >5 minutes, revenue path broken (checkout failures causing $180k GMV loss). Matches SEV1 criteria.
deepseek-chat·prompt v3·output: en·12428ms·2002↑ / 1869↓ tok·$0.00260
Root cause hypotheses
- highStripe Connect API latency spike causing thread pool exhaustion in checkout-svc
Evidence: APM shows Stripe /v1/payment_intents p99 jumped from 800ms to 27s; Stripe status page confirms elevated latency for Connect endpoints in us-east-1; thread pool utilization at 100% with all 200 workers blocked.
- highNo circuit breaker on Stripe client leads to cascading failures
Evidence: Configuration review shows no circuit breaker or bulkhead for Stripe calls; all threads block waiting for Stripe response, causing queue buildup.
- lowInbound request surge exacerbates thread pool exhaustion
Evidence: Inbound queue depth at 4200 (limit 5000), but success rate drop correlates with Stripe latency increase; no evidence of abnormal traffic spike before incident.
- mediumStripe API timeout misconfigured (30s) too high, causing long-held threads
Evidence: Timeout matches inbound gateway timeout; reducing it could free threads faster, but not root cause of Stripe latency.
Investigation checklist
- Confirm Stripe latency via APM and status page
curl -s https://status.stripe.com/api/v1/incidents | jq '.incidents[] | select(.status != "resolved")'Expected: Active incident for Connect endpoints in us-east-1 with elevated latency
- Check thread pool utilization and queue depth in checkout-svc
kubectl exec -n prod deploy/checkout-svc -- curl -s localhost:8080/actuator/threaddump | grep -E 'pool-.*thread.*state=BLOCKED' | wc -lExpected: 200 blocked threads (all workers)
- Verify downstream dependencies (auth-svc, fraud-svc) are healthy
kubectl logs -n prod -l app=auth-svc --since=10m | grep -iE 'error|timeout' | tail -5Expected: No errors or timeouts
- Check Stripe API response times from checkout-svc logs
kubectl logs -n prod -l app=checkout-svc --since=15m | grep -oP 'stripe_request_duration_ms=\K\d+' | sort -n | tail -5Expected: Values near 27000 (27s) indicating high latency
- Review circuit breaker and bulkhead configuration for Stripe client
kubectl exec -n prod deploy/checkout-svc -- cat /app/config/application.yml | grep -A10 'stripe'Expected: No circuit breaker or bulkhead settings; only timeout=30s
Mitigation plan
Enable circuit breaker for Stripe client with fallback to error page or retry later
Risk: May cause immediate failures for all Stripe-dependent checkouts if circuit opens; but prevents thread exhaustion.
Rollback: Disable circuit breaker via feature flag or config reload.
Reduce Stripe call timeout from 30s to 5s to free threads faster
Risk: May cause more timeouts if Stripe latency is high, but reduces thread hold time.
Rollback: Revert timeout to 30s via config update.
Implement bulkhead isolation for Stripe calls (dedicated thread pool of 20 threads)
Risk: Requires code change and redeploy; not immediate. Could be done as hotfix if deployment pipeline allows.
Rollback: Revert to shared thread pool via config.
If Stripe outage persists, disable Stripe payment path and show 'payment unavailable' message
Risk: Lost revenue from Stripe payments, but preserves checkout flow for other payment methods.
Rollback: Re-enable Stripe path when Stripe recovers.
Customer impact
Approximately 78% of checkout attempts are failing with a 504 error. Customers see a timeout or error page when trying to complete purchases. Estimated 3000 failed checkouts in the last 5 minutes, with $180k in lost GMV. No ETA for resolution yet.
Postmortem draft
Summary
Checkout success rate dropped from 99.5% to 22% due to elevated latency from Stripe Connect API, causing thread pool exhaustion and HTTP 504 errors.
Timeline (UTC)
- 18:35: Stripe status page reports elevated latency for Connect endpoints in us-east-1.
- 18:38: Pager alerts for checkout-svc high error rate.
- 18:41: Confirmed Stripe as culprit via APM and status page.
- 18:43: Mitigation options being evaluated.
Impact
- 3000 failed checkouts in 5 minutes.
- $180k estimated lost GMV.
- p99 latency 28s, success rate 22%.
Root Cause
Stripe Connect API latency spike (p99 from 800ms to 27s) combined with lack of circuit breaker or bulkhead in checkout-svc, causing all worker threads to block and queue to fill.
Detection
Pager alert based on error rate threshold. Stripe status page confirmed external issue.
Response
- 18:38: On-call acknowledged.
- 18:41: Identified Stripe as cause.
- [FILL IN] Mitigation actions taken.
What Went Well
- Quick identification of external dependency via APM and status page.
- Clear communication of impact.
What Went Poorly
- No circuit breaker or bulkhead for critical dependency.
- Timeout too long (30s) exacerbating thread exhaustion.
Action Items
- [FILL IN] Implement circuit breaker for Stripe client.
- [FILL IN] Add bulkhead isolation for Stripe calls.
- [FILL IN] Reduce Stripe timeout to 5s.
- [FILL IN] Add monitoring for thread pool utilization and queue depth.
Follow-ups
- P0Implement circuit breaker for Stripe client with fallback— payments-platform
- P0Add bulkhead isolation for Stripe calls (dedicated thread pool)— payments-platform
- P1Reduce Stripe call timeout from 30s to 5s— on-call SRE
- P1Add monitoring alerts for thread pool utilization >80% and queue depth >80%— platform team
- P2Review all external dependency configurations for circuit breakers and bulkheads— service owner
- P2Update runbook for Stripe dependency failure with mitigation steps— on-call SRE
Similar past incidents
lexical match (pg_trgm)
- 63%
[Eval][v1][en] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 59%
[Eval][v2][en] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 52%
[Eval][v1][zh] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 48%
[Eval][v2][zh] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 35%
[Eval][v2][en] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.