[Eval][v1][en] Third-party payment gateway timeouts cascade into checkout outage
service: checkout-svccreated: 6/2/2026, 11:56:00 PM
Raw incident context
Time: 18:40 UTC. checkout-svc starts returning HTTP 504 to ~78% of checkout attempts. Symptoms: - p99 latency: 28s (hitting our 30s gateway timeout) - Success rate: 22% (down from 99.5%) - Thread pool utilization: 100% (all 200 worker threads blocked) - Inbound queue depth: 4200 requests (queue limit 5000) Downstream calls (from APM): - Stripe Connect API (/v1/payment_intents): p99 jumped from 800ms to 27s - Stripe status page: "Investigating elevated latency for Connect endpoints in us-east-1" (posted 18:35 UTC) - All other downstream deps (auth-svc, fraud-svc) normal Our config: - Stripe call timeout: 30s (matches inbound) - No circuit breaker on Stripe client - No bulkhead — Stripe calls share the main worker thread pool Customer impact: - Failed checkouts: ~3000 in 5min - Estimated lost GMV: $180k (so far) On-call: - 18:38 — pager - 18:41 — confirmed Stripe is the culprit (status page + APM) - 18:43 — debating: wait it out vs disable Stripe path entirely
Summary
Checkout success rate dropped to 22% due to Stripe Connect API latency spike (p99 27s) exhausting the shared worker thread pool. All 200 threads blocked waiting on Stripe responses, causing HTTP 504s for 78% of requests. Estimated $180k GMV lost in 5 minutes.
Severity reasoning: User-facing outage with revenue impact (estimated $180k GMV lost). Full degradation of checkout for 78% of users, meeting SEV1 criteria.
deepseek-chat·prompt v1·output: en·10404ms·1370↑ / 1638↓ tok·$0.00217
Root cause hypotheses
- highStripe Connect API latency spike in us-east-1 causes thread pool exhaustion
Evidence: APM shows Stripe /v1/payment_intents p99 jumped from 800ms to 27s; Stripe status page confirms elevated latency for Connect endpoints in us-east-1; thread pool utilization at 100%
- highNo circuit breaker on Stripe client allows unbounded retries or queuing
Evidence: Config review shows no circuit breaker; thread pool exhaustion consistent with all threads blocking on Stripe calls
- mediumStripe call timeout (30s) matches inbound timeout, causing cascading failures
Evidence: p99 latency 28s hits 30s timeout; inbound queue depth 4200/5000 indicates requests piling up
- highShared thread pool without bulkhead allows Stripe latency to block all checkout requests
Evidence: No bulkhead; Stripe calls share main worker pool; all 200 threads blocked
Investigation checklist
- Confirm Stripe latency via APM
curl -s 'http://apm-dashboard:8080/api/v1/services/checkout-svc/transactions?filter=stripe' | jq '.latency.p99'Expected: Value > 20s, confirming Stripe is the bottleneck
- Check thread pool metrics
kubectl exec deploy/checkout-svc -- curl -s localhost:8080/metrics | grep -E 'thread_pool_active|thread_pool_queue'Expected: active threads = 200, queue depth > 4000
- Verify Stripe status page
curl -s 'https://status.stripe.com/api/v1/incidents?filter=us-east-1' | jq '.incidents[0].status'Expected: "investigating" or "identified" for Connect endpoints
- Check if circuit breaker exists in Stripe client config
kubectl exec deploy/checkout-svc -- cat /app/config.yaml | grep -A5 'stripe'Expected: No circuit_breaker or bulkhead settings present
- Check inbound queue depth trend
kubectl exec deploy/checkout-svc -- curl -s localhost:8080/metrics | grep 'inbound_queue_depth'Expected: Value near 5000 (queue limit), indicating backpressure
Mitigation plan
Enable circuit breaker on Stripe client with failure threshold 5, timeout 10s, half-open after 30s
Risk: May temporarily reject valid payments if Stripe recovers slowly; no data loss
Rollback: Revert config change and restart checkout-svc pods
Reduce Stripe call timeout from 30s to 5s to fail fast and free threads
Risk: Some legitimate payments may fail if Stripe latency >5s; no data loss
Rollback: Revert timeout to 30s and restart pods
If Stripe outage persists, disable Stripe payment path and fall back to 'payment failed' page
Risk: All Stripe payments will fail; no data loss; users can retry later
Rollback: Re-enable Stripe path via feature flag
Customer impact
Approximately 78% of checkout attempts are failing with HTTP 504 errors. Users cannot complete purchases using any payment method that goes through Stripe. Estimated 3000 failed checkouts in 5 minutes, with $180k GMV lost. No ETA for resolution; Stripe is investigating.
Postmortem draft
Postmortem: Checkout Outage due to Stripe Latency
Summary:
- Date: [Date]
- Duration: [Start] - [End]
- Severity: SEV1
Timeline:
- 18:35 UTC - Stripe reports elevated latency for Connect endpoints
- 18:40 UTC - checkout-svc success rate drops to 22%
- 18:41 UTC - On-call confirms Stripe is culprit
- [Mitigation time] - Circuit breaker enabled
- [Recovery time] - Success rate returns to normal
Impact:
- 3000 failed checkouts, $180k GMV lost
- p99 latency 28s, thread pool exhausted
Root Cause:
- Stripe Connect API latency spike caused all worker threads to block, exhausting thread pool
- No circuit breaker or bulkhead on Stripe client
What Went Well:
- Quick detection via APM and Stripe status page
- On-call identified root cause within 3 minutes
What Went Poorly:
- No circuit breaker or bulkhead in place
- Stripe timeout matched inbound timeout, causing cascading failures
Action Items:
- [ ] Add circuit breaker to Stripe client (P0)
- [ ] Add bulkhead for Stripe calls (P1)
- [ ] Reduce Stripe timeout to 5s (P1)
- [ ] Add alert on thread pool utilization >80% (P1)
Follow-ups
- P0Add circuit breaker to Stripe client with proper thresholds— service owner
- P1Implement bulkhead isolation for Stripe calls— service owner
- P1Reduce Stripe call timeout to 5s— service owner
- P1Add alert on thread pool utilization >80%— on-call SRE
- P2Review all downstream dependencies for circuit breaker and bulkhead patterns— platform team
- P2Update runbook for Stripe outage scenarios— on-call SRE
Similar past incidents
lexical match (pg_trgm)
- 100%
[Eval][v1][en] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 75%
[Eval][v2][en] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 53%
[Eval][v1][zh] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 48%
[Eval][v2][zh] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 33%
[Eval][v1][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts