[Eval][v3][en] Third-party payment gateway timeouts cascade into checkout outage
service: checkout-svccreated: 6/2/2026, 11:59:06 PM
Raw incident context
Time: 18:40 UTC. checkout-svc starts returning HTTP 504 to ~78% of checkout attempts. Symptoms: - p99 latency: 28s (hitting our 30s gateway timeout) - Success rate: 22% (down from 99.5%) - Thread pool utilization: 100% (all 200 worker threads blocked) - Inbound queue depth: 4200 requests (queue limit 5000) Downstream calls (from APM): - Stripe Connect API (/v1/payment_intents): p99 jumped from 800ms to 27s - Stripe status page: "Investigating elevated latency for Connect endpoints in us-east-1" (posted 18:35 UTC) - All other downstream deps (auth-svc, fraud-svc) normal Our config: - Stripe call timeout: 30s (matches inbound) - No circuit breaker on Stripe client - No bulkhead — Stripe calls share the main worker thread pool Customer impact: - Failed checkouts: ~3000 in 5min - Estimated lost GMV: $180k (so far) On-call: - 18:38 — pager - 18:41 — confirmed Stripe is the culprit (status page + APM) - 18:43 — debating: wait it out vs disable Stripe path entirely
Summary
Checkout success rate dropped from 99.5% to 22% due to upstream Stripe Connect API latency spike (p99 27s), causing thread pool exhaustion and 30s timeouts. Stripe status page confirms elevated latency for Connect endpoints in us-east-1 since 18:35 UTC. No circuit breaker or bulkhead on Stripe client, so all 200 worker threads blocked, queue depth 4200. Estimated 3000 failed checkouts and $180k lost GMV in 5 minutes.
Severity reasoning: User-facing outage: success rate dropped to 22% (below 99% threshold) for >5 minutes, revenue path broken (checkout failures causing $180k GMV loss). Matches SEV1 criteria: error rate >1% for >5 min, revenue path broken.
deepseek-chat·prompt v3·output: en·12942ms·2002↑ / 2007↓ tok·$0.00275
Root cause hypotheses
- highStripe Connect API latency spike causes thread pool exhaustion in checkout-svc
Evidence: APM shows Stripe /v1/payment_intents p99 jumped from 800ms to 27s; Stripe status page confirms elevated latency for Connect endpoints in us-east-1 since 18:35 UTC; thread pool utilization 100%, all 200 threads blocked.
- highNo circuit breaker on Stripe client allows cascading failures
Evidence: Config review shows no circuit breaker or bulkhead; Stripe calls share main worker thread pool; inbound queue depth 4200 (limit 5000) indicates requests piling up.
- mediumStripe timeout (30s) matches inbound gateway timeout, causing no fast failure
Evidence: Stripe call timeout is 30s, same as inbound gateway timeout; p99 latency 28s hits timeout, but threads block for full 30s before releasing.
- mediumRegional issue in us-east-1 affecting Stripe Connect
Evidence: Stripe status page specifically calls out us-east-1; checkout-svc likely uses us-east-1 Stripe endpoints.
- lowAuth-svc or fraud-svc contributing to latency
Evidence: APM shows auth-svc and fraud-svc latencies normal; ruled out.
Investigation checklist
- Confirm Stripe status page for Connect latency
curl -s https://status.stripe.com/ | grep -i 'connect'Expected: Should show 'Investigating elevated latency for Connect endpoints in us-east-1' or similar.
- Check thread pool utilization and queue depth in checkout-svc
kubectl exec -n prod deploy/checkout-svc -- curl -s localhost:8080/actuator/threadpool | jq '.threadPool.utilization, .queueDepth'Expected: Utilization near 100%, queue depth near 4200.
- Verify Stripe call latency in APM
curl -s 'http://apm-dashboard.internal/d/checkout-svc?var-service=checkout-svc&var-downstream=stripe&from=now-15m' | jq '.series[0].values[-1]'Expected: p99 latency for Stripe calls should be ~27s.
- Check if circuit breaker exists on Stripe client
kubectl exec -n prod deploy/checkout-svc -- cat /app/config.yml | grep -A5 'stripe' | grep 'circuitBreaker'Expected: No output (circuit breaker not configured).
- Check other downstream services latency
kubectl exec -n prod deploy/checkout-svc -- curl -s localhost:8080/actuator/health | jq '.details.auth-svc.latency, .details.fraud-svc.latency'Expected: Latency should be normal (<100ms).
Mitigation plan
Enable circuit breaker on Stripe client to fail fast when latency exceeds threshold (e.g., 2s).
Risk: May cause partial checkout failures if circuit opens, but prevents thread pool exhaustion. Safer than disabling Stripe entirely.
Rollback: Disable circuit breaker via feature flag or config revert.
Add bulkhead isolation for Stripe calls (dedicated thread pool of 20 threads) to protect main worker pool.
Risk: If bulkhead is too small, Stripe calls may queue up; but main pool remains healthy.
Rollback: Remove bulkhead config or increase pool size.
Reduce Stripe call timeout from 30s to 5s to fail fast and free threads sooner.
Risk: May cause false positives during transient latency spikes; but acceptable during incident.
Rollback: Revert timeout to 30s.
If Stripe outage persists, disable Stripe payment path and fall back to alternative payment provider or show error message.
Risk: Lost revenue from Stripe payments; but prevents further failures. Destructive if no fallback.
Rollback: Re-enable Stripe path when Stripe recovers.
Customer impact
Approximately 78% of checkout attempts are failing with a 504 error. Users see a timeout or error page after waiting up to 30 seconds. Estimated 3000 failed checkouts in the last 5 minutes, resulting in approximately $180,000 in lost revenue. No ETA yet; Stripe is investigating.
Postmortem draft
Summary
Checkout success rate dropped from 99.5% to 22% due to upstream Stripe Connect API latency spike. No circuit breaker or bulkhead caused thread pool exhaustion.
Timeline (UTC)
- 18:35 - Stripe status page: 'Investigating elevated latency for Connect endpoints in us-east-1'
- 18:38 - Pager triggered for checkout-svc high error rate
- 18:40 - Success rate drops to 22%, p99 latency 28s
- 18:41 - Confirmed Stripe is culprit via APM and status page
- 18:43 - Team debating mitigation options
- [FILL IN] - Mitigation applied
- [FILL IN] - Recovery complete
Impact
- 3000 failed checkouts in 5 minutes
- $180k lost GMV
- 78% of users affected
Root Cause
Stripe Connect API latency spike (p99 27s) combined with lack of circuit breaker and bulkhead caused thread pool exhaustion in checkout-svc.
Detection
Pager alert based on error rate threshold. Could have been faster if circuit breaker metrics were monitored.
Response
[FILL IN]
What Went Well
- Quick identification of upstream dependency
- Status page confirmation
What Went Poorly
- No circuit breaker or bulkhead
- Timeout too long (30s)
- No automated fallback
Action Items
- [ ] Add circuit breaker to Stripe client (P0)
- [ ] Add bulkhead isolation for Stripe calls (P0)
- [ ] Reduce Stripe timeout to 5s (P1)
- [ ] Create runbook for Stripe outage (P1)
- [ ] Monitor Stripe status page proactively (P2)
Follow-ups
- P0Add circuit breaker to Stripe client with 2s timeout and 50% failure threshold— payments-platform
- P0Implement bulkhead isolation for Stripe calls with dedicated thread pool of 20 threads— payments-platform
- P1Reduce Stripe call timeout from 30s to 5s— on-call SRE
- P1Create runbook for Stripe outage with fallback procedures— service owner
- P2Set up alert on Stripe status page changes via webhook— platform team
- P2Review and tune thread pool and queue limits for checkout-svc— service owner
Similar past incidents
lexical match (pg_trgm)
- 64%
[Eval][v1][en] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 63%
[Eval][v2][en] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 47%
[Eval][v1][zh] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 44%
[Eval][v2][zh] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 32%
[Eval][v1][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts