[Eval][v1][en] Third-party payment gateway timeouts cascade into checkout outage
service: checkout-svccreated: 5/25/2026, 10:21:26 PM
Raw incident context
Time: 18:40 UTC. checkout-svc starts returning HTTP 504 to ~78% of checkout attempts. Symptoms: - p99 latency: 28s (hitting our 30s gateway timeout) - Success rate: 22% (down from 99.5%) - Thread pool utilization: 100% (all 200 worker threads blocked) - Inbound queue depth: 4200 requests (queue limit 5000) Downstream calls (from APM): - Stripe Connect API (/v1/payment_intents): p99 jumped from 800ms to 27s - Stripe status page: "Investigating elevated latency for Connect endpoints in us-east-1" (posted 18:35 UTC) - All other downstream deps (auth-svc, fraud-svc) normal Our config: - Stripe call timeout: 30s (matches inbound) - No circuit breaker on Stripe client - No bulkhead — Stripe calls share the main worker thread pool Customer impact: - Failed checkouts: ~3000 in 5min - Estimated lost GMV: $180k (so far) On-call: - 18:38 — pager - 18:41 — confirmed Stripe is the culprit (status page + APM) - 18:43 — debating: wait it out vs disable Stripe path entirely
Summary
Checkout success rate dropped to 22% due to Stripe Connect API latency spike (p99 27s) exhausting the shared worker thread pool. All 200 threads blocked waiting on Stripe responses, causing HTTP 504s for 78% of requests. Estimated $180k GMV lost in 5 minutes.
Severity reasoning: User-facing outage with revenue impact (estimated $180k GMV lost). Full degradation of checkout for 78% of users, meeting SEV1 criteria.
deepseek-chat·prompt v1·output: en·9682ms
Root cause hypotheses
- highStripe Connect API experiencing elevated latency in us-east-1, causing downstream timeouts
Evidence: Stripe status page confirms 'Investigating elevated latency for Connect endpoints in us-east-1' posted 18:35 UTC. APM shows Stripe /v1/payment_intents p99 jumped from 800ms to 27s.
- highNo circuit breaker on Stripe client allows all threads to block, leading to thread pool exhaustion
Evidence: Thread pool utilization at 100% with all 200 threads blocked. Inbound queue depth 4200. No circuit breaker configured.
- mediumStripe call timeout (30s) matches inbound gateway timeout, causing cascading failures
Evidence: p99 latency 28s hitting 30s timeout. No bulkhead isolation for Stripe calls.
Investigation checklist
- Confirm Stripe status page for ongoing incident
curl -s https://status.stripe.com/api/v1/incidents | jq '.incidents[] | select(.status != "resolved")'Expected: Active incident for Connect endpoints in us-east-1
- Check thread pool metrics in checkout-svc
kubectl exec deploy/checkout-svc -- curl -s localhost:8080/metrics | grep -E 'tomcat_threads|jvm_threads'Expected: tomcat_threads_busy == 200, tomcat_threads_max == 200
- Verify downstream dependency latencies in APM
kubectl exec deploy/checkout-svc -- curl -s localhost:8080/actuator/health | jq '.components.stripe'Expected: Stripe health check failing or latency > 20s
- Check inbound queue depth
kubectl exec deploy/checkout-svc -- curl -s localhost:8080/metrics | grep 'http_server_requests_seconds_max'Expected: Queue depth near 5000 limit
Mitigation plan
Enable circuit breaker on Stripe client with fallback to fail-fast (return 503 immediately) to protect thread pool
Risk: All Stripe-dependent checkouts will fail fast instead of timing out; may increase error rate but prevents thread exhaustion
Rollback: Disable circuit breaker via feature flag or config revert
If Stripe incident persists, temporarily disable Stripe payment path and fall back to alternative payment provider or show maintenance page
Risk: Loss of Stripe payment capability; may impact revenue but prevents total checkout failure
Rollback: Re-enable Stripe path once status page shows resolved
Scale up checkout-svc replicas to increase thread pool capacity while Stripe is degraded
Risk: May increase load on Stripe; temporary measure
Rollback: Scale down to original replicas after incident
Customer impact
Approximately 78% of checkout attempts are failing with HTTP 504 errors. Users cannot complete purchases. Estimated 3000 failed checkouts in 5 minutes, with $180k GMV lost. No ETA for resolution; depends on Stripe recovery.
Postmortem draft
Postmortem: Checkout Service Outage
Date: YYYY-MM-DD Severity: SEV1 Duration: TBD
Summary
Checkout success rate dropped from 99.5% to 22% due to Stripe Connect API latency spike. Thread pool exhaustion caused cascading failures.
Timeline
- 18:35 UTC: Stripe status page reports elevated latency
- 18:38 UTC: Pager alert for checkout-svc
- 18:41 UTC: Confirmed Stripe as root cause
- 18:43 UTC: Mitigation actions started
Impact
- 3000 failed checkouts
- $180k GMV lost
- p99 latency 28s
Root Cause
Stripe Connect API latency spike (p99 27s) combined with lack of circuit breaker and bulkhead isolation.
What Went Well
- Quick detection via pager
- Accurate root cause identification via APM and status page
What Went Poorly
- No circuit breaker on Stripe client
- No bulkhead for downstream calls
- Timeout configuration matched inbound timeout
Action Items
- [ ] Add circuit breaker to Stripe client (P0)
- [ ] Implement bulkhead pattern for downstream dependencies (P1)
- [ ] Reduce Stripe timeout to 5s with retry (P1)
- [ ] Add monitoring for thread pool utilization (P2)
Follow-ups
- P0Add circuit breaker to Stripe client with fallback— service owner
- P1Implement bulkhead isolation for downstream calls— service owner
- P1Reduce Stripe timeout to 5s with retry logic— service owner
- P2Add thread pool utilization alert— on-call SRE
- P2Review all downstream dependency configurations for circuit breakers— platform team
Similar past incidents
lexical match (pg_trgm)
- 75%
[Eval][v2][en] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 53%
[Eval][v1][zh] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 48%
[Eval][v2][zh] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 33%
[Eval][v1][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 30%
[Eval][v2][en] Order service OOM crashloop following v3.7 deploy
Pods OOM-killed every ~20min, restart loop, p99 latency degraded, ~3% requests timing out