[Eval][v1][en] Third-party payment gateway timeouts cascade into checkout outage
service: checkout-svccreated: 6/2/2026, 11:37:58 PM
Raw incident context
Time: 18:40 UTC. checkout-svc starts returning HTTP 504 to ~78% of checkout attempts. Symptoms: - p99 latency: 28s (hitting our 30s gateway timeout) - Success rate: 22% (down from 99.5%) - Thread pool utilization: 100% (all 200 worker threads blocked) - Inbound queue depth: 4200 requests (queue limit 5000) Downstream calls (from APM): - Stripe Connect API (/v1/payment_intents): p99 jumped from 800ms to 27s - Stripe status page: "Investigating elevated latency for Connect endpoints in us-east-1" (posted 18:35 UTC) - All other downstream deps (auth-svc, fraud-svc) normal Our config: - Stripe call timeout: 30s (matches inbound) - No circuit breaker on Stripe client - No bulkhead — Stripe calls share the main worker thread pool Customer impact: - Failed checkouts: ~3000 in 5min - Estimated lost GMV: $180k (so far) On-call: - 18:38 — pager - 18:41 — confirmed Stripe is the culprit (status page + APM) - 18:43 — debating: wait it out vs disable Stripe path entirely
Summary
Checkout success rate dropped to 22% due to Stripe Connect API latency spike (p99 27s) exhausting the shared worker thread pool, causing HTTP 504 timeouts for 78% of requests. Estimated $180k GMV lost in 5 minutes.
Severity reasoning: User-facing outage with revenue impact ($180k GMV lost in 5 minutes), affecting 78% of checkout attempts. Full degradation of a critical payment path.
deepseek-chat·prompt v1·output: en·11813ms·1370↑ / 1823↓ tok·$0.00238
Root cause hypotheses
- highStripe Connect API latency spike in us-east-1 causing downstream timeouts
Evidence: APM shows Stripe /v1/payment_intents p99 jumped from 800ms to 27s; Stripe status page confirms elevated latency for Connect endpoints in us-east-1 since 18:35 UTC.
- highThread pool exhaustion due to long-running Stripe calls blocking all worker threads
Evidence: Thread pool utilization at 100% (200 threads blocked), inbound queue depth 4200 (limit 5000). No circuit breaker or bulkhead on Stripe client.
- mediumStripe timeout (30s) matching inbound gateway timeout causing cascading failures
Evidence: Stripe call timeout set to 30s, same as inbound gateway timeout. No timeout differentiation or retry budget.
- lowRecent deployment or config change to checkout-svc increased Stripe call volume or changed behavior
Evidence: No evidence in context; but should be checked via deployment history and recent config changes.
Investigation checklist
- Confirm Stripe status page and APM data for ongoing latency
curl -s https://status.stripe.com | grep -i 'connect' ; check APM dashboard for Stripe p99 latency over last 10 minutesExpected: Stripe status shows 'Investigating' or 'Resolved'; APM shows Stripe p99 > 20s or returning to normal
- Check thread pool utilization and queue depth in checkout-svc
kubectl exec deploy/checkout-svc -- curl -s localhost:8080/actuator/health | jq '.threadPool' ; or check metrics endpoint for 'tomcat.threads.busy' and 'tomcat.threads.config.max'Expected: Busy threads = 200 (max), queue depth near 5000
- Verify no other downstream dependencies are degraded
kubectl exec deploy/checkout-svc -- curl -s localhost:8080/actuator/health | jq '.components' ; check APM for auth-svc and fraud-svc p99 latencyExpected: auth-svc and fraud-svc p99 < 100ms, health checks passing
- Check recent deployments or config changes to checkout-svc
kubectl rollout history deploy/checkout-svc ; kubectl get configmap checkout-svc -o yaml | grep -i stripeExpected: No recent changes; Stripe timeout still 30s, no circuit breaker config
- Check if Stripe calls are idempotent and can be retried safely
grep -r 'idempotency' /etc/checkout-svc/config ; check Stripe API docs for idempotency keysExpected: Idempotency keys are used; retries are safe
Mitigation plan
Enable circuit breaker on Stripe client with timeout 5s, failure threshold 50% in 60s window, half-open after 30s
Risk: May cause partial failures if circuit opens prematurely; but prevents thread pool exhaustion
Rollback: Disable circuit breaker by reverting config change or setting threshold to 100%
Add bulkhead for Stripe calls: limit concurrent Stripe calls to 20 threads, queue 50
Risk: May throttle legitimate traffic if bulkhead too small; monitor queue drops
Rollback: Increase bulkhead size or remove bulkhead config
Reduce Stripe call timeout from 30s to 5s to fail fast and free threads
Risk: May cause more failures if Stripe latency is transient; but prevents thread exhaustion
Rollback: Revert timeout to 30s
If Stripe outage persists >10 min, disable Stripe payment path and fall back to 'payment method unavailable' message
Risk: Lost revenue from Stripe payments; but prevents further failures
Rollback: Re-enable Stripe path after confirming Stripe recovery
Customer impact
Approximately 78% of checkout attempts are failing with HTTP 504 errors. Users see a timeout or error page after waiting up to 30 seconds. Estimated 3000 failed checkouts in 5 minutes, with $180k lost GMV. No ETA for full recovery; Stripe is investigating.
Postmortem draft
Postmortem: Checkout-svc Outage (2025-03-27)
Summary
Brief description of the incident.
Timeline
- 18:35 UTC: Stripe reports elevated latency for Connect endpoints.
- 18:40 UTC: checkout-svc success rate drops to 22%, p99 latency 28s.
- 18:41 UTC: On-call confirms Stripe is the culprit.
- 18:43 UTC: Mitigation actions started (circuit breaker, bulkhead, timeout reduction).
- [Time] UTC: Recovery confirmed.
Impact
- 3000 failed checkouts, $180k lost GMV.
- Users experienced 504 errors for 78% of attempts.
Root Cause
Stripe Connect API latency spike caused thread pool exhaustion in checkout-svc due to lack of circuit breaker and bulkhead.
What Went Well
- Quick detection via APM and Stripe status page.
- On-call identified root cause within 3 minutes.
What Went Poorly
- No circuit breaker or bulkhead on Stripe client.
- Stripe timeout (30s) matched inbound timeout, causing cascading failures.
Action Items
- [ ] Add circuit breaker to Stripe client (P0)
- [ ] Add bulkhead for Stripe calls (P0)
- [ ] Reduce Stripe timeout to 5s (P1)
- [ ] Implement fallback payment path (P1)
- [ ] Review all downstream dependencies for similar issues (P2)
Follow-ups
- P0Add circuit breaker to Stripe client with proper thresholds— service owner
- P0Add bulkhead for Stripe calls to isolate thread pool— service owner
- P1Reduce Stripe call timeout to 5s and implement retry with backoff— service owner
- P1Implement fallback payment method when Stripe is degraded— platform team
- P2Review all downstream dependencies for circuit breaker and bulkhead patterns— platform team
- P1Add alerting on thread pool utilization >80% and queue depth >1000— on-call SRE
Similar past incidents
lexical match (pg_trgm)
- 89%
[Eval][v1][en] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 68%
[Eval][v2][en] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 57%
[Eval][v1][zh] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 53%
[Eval][v2][zh] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 30%
[Eval][v1][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts