[Eval][v3][en] Third-party payment gateway timeouts cascade into checkout outage
service: checkout-svccreated: 6/2/2026, 11:38:59 PM
Raw incident context
Time: 18:40 UTC. checkout-svc starts returning HTTP 504 to ~78% of checkout attempts. Symptoms: - p99 latency: 28s (hitting our 30s gateway timeout) - Success rate: 22% (down from 99.5%) - Thread pool utilization: 100% (all 200 worker threads blocked) - Inbound queue depth: 4200 requests (queue limit 5000) Downstream calls (from APM): - Stripe Connect API (/v1/payment_intents): p99 jumped from 800ms to 27s - Stripe status page: "Investigating elevated latency for Connect endpoints in us-east-1" (posted 18:35 UTC) - All other downstream deps (auth-svc, fraud-svc) normal Our config: - Stripe call timeout: 30s (matches inbound) - No circuit breaker on Stripe client - No bulkhead — Stripe calls share the main worker thread pool Customer impact: - Failed checkouts: ~3000 in 5min - Estimated lost GMV: $180k (so far) On-call: - 18:38 — pager - 18:41 — confirmed Stripe is the culprit (status page + APM) - 18:43 — debating: wait it out vs disable Stripe path entirely
Summary
Checkout success rate dropped from 99.5% to 22% due to elevated latency from Stripe Connect API, causing thread pool exhaustion and HTTP 504 errors. Stripe reported issues with their Connect endpoints in us-east-1 starting at 18:35 UTC. No circuit breaker or bulkhead on Stripe calls led to cascading failure.
Severity reasoning: User-facing outage: checkout success rate dropped to 22% (below 99% threshold), error rate >1% for >5 minutes, revenue path broken (estimated $180k GMV lost). This meets SEV1 criteria.
deepseek-chat·prompt v3·output: en·12353ms·2002↑ / 2026↓ tok·$0.00277
Root cause hypotheses
- highStripe Connect API latency spike caused thread pool exhaustion in checkout-svc
Evidence: APM shows Stripe Connect API p99 latency jumped from 800ms to 27s. Thread pool utilization at 100%, all 200 threads blocked. Stripe status page confirms elevated latency for Connect endpoints in us-east-1.
- highNo circuit breaker on Stripe client allowed unbounded retries or waiting
Evidence: Configuration review shows no circuit breaker or bulkhead for Stripe calls. Inbound queue depth 4200 (limit 5000) indicates requests piling up.
- mediumStripe timeout (30s) matches inbound gateway timeout, causing no early failure
Evidence: Stripe call timeout is 30s, same as inbound gateway timeout. This means downstream calls can block threads for the full timeout period without fast-failing.
- mediumStripe issue is isolated to us-east-1, but checkout-svc only calls that region
Evidence: Stripe status page mentions us-east-1. No evidence of multi-region Stripe configuration.
- lowAuth-svc or fraud-svc also affected but not detected
Evidence: APM shows auth-svc and fraud-svc latencies normal. No evidence of impact.
Investigation checklist
- Verify Stripe status page for current incident details
curl -s https://status.stripe.com/api/v1/incidents | jq '.incidents[] | select(.status != "resolved")'Expected: Incident with elevated latency for Connect endpoints in us-east-1, status 'investigating' or 'identified'
- Check thread pool utilization and queue depth in checkout-svc
kubectl exec -n prod deploy/checkout-svc -- curl -s localhost:8080/actuator/threadpool | jq '.threadPool.utilization, .queueDepth'Expected: utilization: 100%, queueDepth: near 5000
- Check Stripe client configuration for circuit breaker and timeout
kubectl exec -n prod deploy/checkout-svc -- cat /app/config/stripe.yml | grep -E 'timeout|circuit|bulkhead'Expected: timeout: 30000, no circuit breaker or bulkhead settings
- Check APM for Stripe Connect latency trend
curl -s 'http://apm-dashboard:8080/api/v1/metrics?service=checkout-svc&metric=stripe.latency&time=18:35-18:45' | jq '.series[] | select(.tags.endpoint=="/v1/payment_intents")'Expected: p99 latency spiking from 800ms to 27s starting at 18:35
- Check if Stripe has a failover region or alternate endpoint
kubectl exec -n prod deploy/checkout-svc -- cat /app/config/stripe.yml | grep -E 'region|endpoint'Expected: Only us-east-1 endpoint configured
Mitigation plan
Enable circuit breaker on Stripe client to fail fast when latency exceeds threshold (e.g., 5s)
Risk: May cause false positives if threshold too low; could temporarily block legitimate traffic if Stripe recovers quickly
Rollback: Disable circuit breaker by reverting config change
Reduce Stripe call timeout from 30s to 5s to free threads faster
Risk: Some legitimate Stripe calls may timeout if latency is high but not failing; could increase error rate
Rollback: Revert timeout to 30s
Implement bulkhead isolation for Stripe calls (dedicated thread pool of 50 threads)
Risk: May limit throughput to Stripe; if bulkhead pool is too small, Stripe calls may queue up
Rollback: Remove bulkhead config or increase pool size
If Stripe incident persists, disable Stripe payment path and fall back to alternative payment provider or show maintenance page
Risk: Lost revenue from Stripe payments; customer frustration if payment fails
Rollback: Re-enable Stripe path after incident resolved
Customer impact
Approximately 78% of checkout attempts are failing with a 504 error. Customers cannot complete purchases using credit/debit cards processed through Stripe. Estimated 3000 failed checkouts in the last 5 minutes, with $180k in lost revenue. No ETA for resolution; Stripe is investigating.
Postmortem draft
Summary
Checkout success rate dropped from 99.5% to 22% due to elevated latency from Stripe Connect API, causing thread pool exhaustion. No circuit breaker or bulkhead on Stripe calls led to cascading failure.
Timeline (UTC)
- 18:35 - Stripe status page reports elevated latency for Connect endpoints in us-east-1
- 18:38 - Pager alert for checkout-svc high error rate
- 18:40 - Checkout success rate at 22%, p99 latency 28s
- 18:41 - Confirmed Stripe as culprit via APM and status page
- 18:43 - Team debating mitigation options
- [FILL IN] - Mitigation applied
- [FILL IN] - Incident resolved
Impact
- 3000 failed checkouts in 5 minutes
- Estimated $180k lost GMV
- 78% of users affected
Root Cause
Stripe Connect API latency spike (p99 from 800ms to 27s) combined with lack of circuit breaker and bulkhead caused thread pool exhaustion in checkout-svc.
Detection
Pager alert triggered by error rate threshold. Stripe status page confirmed external dependency issue.
Response
- [FILL IN] - Actions taken
What Went Well
- Quick identification of Stripe as root cause via APM and status page
- Team communication clear
What Went Poorly
- No circuit breaker or bulkhead on Stripe calls
- Timeout configuration matched inbound timeout, exacerbating thread exhaustion
Action Items
- [ ] Add circuit breaker to Stripe client
- [ ] Implement bulkhead isolation for downstream calls
- [ ] Reduce Stripe timeout to 5s
- [ ] Add monitoring for thread pool utilization and queue depth
- [ ] Review all external dependency configurations for resilience patterns
Follow-ups
- P0Add circuit breaker to Stripe client with appropriate thresholds— checkout-svc team
- P0Implement bulkhead isolation for Stripe calls (dedicated thread pool)— checkout-svc team
- P1Reduce Stripe call timeout from 30s to 5s— checkout-svc team
- P1Add monitoring alerts for thread pool utilization and queue depth— platform team
- P1Review all external dependency configurations for resilience patterns (circuit breaker, bulkhead, timeouts)— platform team
- P2Create runbook for Stripe dependency failure with mitigation steps— on-call SRE
- P2Evaluate multi-region Stripe configuration for failover— checkout-svc team
Similar past incidents
lexical match (pg_trgm)
- 54%
[Eval][v1][en] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 54%
[Eval][v2][en] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 49%
[Eval][v2][zh] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 48%
[Eval][v1][zh] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 35%
[Eval][v2][en] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.