[Eval][v1][en] Third-party payment gateway timeouts cascade into checkout outage
service: checkout-svccreated: 6/2/2026, 11:55:32 PM
Raw incident context
Time: 18:40 UTC. checkout-svc starts returning HTTP 504 to ~78% of checkout attempts. Symptoms: - p99 latency: 28s (hitting our 30s gateway timeout) - Success rate: 22% (down from 99.5%) - Thread pool utilization: 100% (all 200 worker threads blocked) - Inbound queue depth: 4200 requests (queue limit 5000) Downstream calls (from APM): - Stripe Connect API (/v1/payment_intents): p99 jumped from 800ms to 27s - Stripe status page: "Investigating elevated latency for Connect endpoints in us-east-1" (posted 18:35 UTC) - All other downstream deps (auth-svc, fraud-svc) normal Our config: - Stripe call timeout: 30s (matches inbound) - No circuit breaker on Stripe client - No bulkhead — Stripe calls share the main worker thread pool Customer impact: - Failed checkouts: ~3000 in 5min - Estimated lost GMV: $180k (so far) On-call: - 18:38 — pager - 18:41 — confirmed Stripe is the culprit (status page + APM) - 18:43 — debating: wait it out vs disable Stripe path entirely
Summary
Checkout success rate collapsed to 22% due to Stripe Connect API latency spike (p99 27s) exhausting the shared worker thread pool. All 200 threads blocked waiting on Stripe responses, causing 78% of requests to hit the 30s gateway timeout. Estimated $180k GMV lost in 5 minutes.
Severity reasoning: User-facing outage with revenue impact (estimated $180k GMV lost). Full degradation of checkout for 78% of users, meeting SEV1 criteria.
deepseek-chat·prompt v1·output: en·9737ms·1370↑ / 1504↓ tok·$0.00202
Root cause hypotheses
- highStripe Connect API latency spike in us-east-1 causing downstream timeouts
Evidence: APM shows Stripe /v1/payment_intents p99 jumped from 800ms to 27s; Stripe status page confirms elevated latency for Connect endpoints in us-east-1 since 18:35 UTC
- highNo circuit breaker on Stripe client allowing all threads to block
Evidence: Config review shows no circuit breaker or timeout shorter than inbound 30s; thread pool 100% utilization with all 200 threads blocked
- highNo bulkhead isolation causing Stripe latency to exhaust shared thread pool
Evidence: Thread pool shared across all downstream calls; Stripe calls consume all threads, blocking other dependencies (auth-svc, fraud-svc) which are healthy
Investigation checklist
- Confirm Stripe is the bottleneck by checking APM traces for checkout-svc
kubectl exec -n prod deploy/checkout-svc -- curl -s localhost:3000/metrics | grep 'http_client_requests_seconds_sum{downstream="stripe"}'Expected: High latency values (>20s) for Stripe downstream calls
- Check thread pool metrics to confirm exhaustion
kubectl exec -n prod deploy/checkout-svc -- curl -s localhost:3000/metrics | grep 'thread_pool_active_threads'Expected: Value equal to max threads (200) indicating full utilization
- Verify Stripe status page for ongoing incident
curl -s https://status.stripe.com/api/v1/incidents | jq '.incidents[] | select(.status != "resolved")'Expected: Active incident for Connect endpoints in us-east-1
- Check if other downstream dependencies are affected
kubectl exec -n prod deploy/checkout-svc -- curl -s localhost:3000/metrics | grep 'http_client_requests_seconds_sum{downstream=~"auth-svc|fraud-svc"}'Expected: Normal latency (<100ms) for auth-svc and fraud-svc
Mitigation plan
Disable Stripe payment path by toggling feature flag 'stripe_payments_enabled' to false, falling back to alternative payment provider or showing maintenance page
Risk: Customers cannot pay via Stripe; alternative provider may have capacity issues. No data loss.
Rollback: Set feature flag back to true once Stripe incident resolved
If feature flag not available, scale up checkout-svc replicas to increase thread pool capacity temporarily
Risk: Increased load on downstream services; may not help if Stripe latency persists. No destructive operations.
Rollback: Scale down replicas after Stripe recovery
Add circuit breaker and bulkhead for Stripe client in code (long-term fix, not immediate)
Risk: Requires code change and deployment; not for immediate mitigation.
Rollback: Revert code change if issues
Customer impact
Approximately 78% of checkout attempts are failing with HTTP 504 errors. Estimated 3000 failed checkouts in 5 minutes, resulting in ~$180k lost GMV. Customers cannot complete purchases using Stripe. No ETA for resolution; Stripe is investigating.
Postmortem draft
Postmortem: Checkout-svc Outage (2025-04-10)
Summary
Brief description of incident.
Timeline
- 18:35 UTC: Stripe reports elevated latency for Connect endpoints
- 18:40 UTC: checkout-svc success rate drops to 22%
- 18:41 UTC: On-call paged
- 18:43 UTC: Stripe confirmed as culprit
- [Mitigation time]: Feature flag disabled Stripe path
- [Recovery time]: Success rate returned to normal
Impact
- Failed checkouts: ~3000
- Lost GMV: ~$180k
- Duration: [X] minutes
Root Cause
Stripe Connect API latency spike combined with lack of circuit breaker and bulkhead isolation caused thread pool exhaustion.
What Went Well
- Quick detection via metrics and APM
- Stripe status page provided external confirmation
What Went Poorly
- No circuit breaker on Stripe client
- No bulkhead isolation for downstream calls
- No feature flag to disable Stripe quickly (if not available)
Action Items
- [ ] Add circuit breaker to Stripe client (P0)
- [ ] Implement bulkhead isolation for downstream calls (P1)
- [ ] Create feature flag to disable payment providers (P1)
- [ ] Review timeout configurations (P2)
Follow-ups
- P0Add circuit breaker to Stripe HTTP client with timeout and failure threshold— service owner
- P1Implement bulkhead isolation for downstream calls (separate thread pool for each dependency)— service owner
- P1Create feature flag to disable Stripe payment path without deployment— platform team
- P2Review and reduce Stripe call timeout from 30s to 5s with retry— service owner
- P2Add alert on thread pool utilization >80%— on-call SRE
Similar past incidents
lexical match (pg_trgm)
- 93%
[Eval][v1][en] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 73%
[Eval][v2][en] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 53%
[Eval][v1][zh] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 49%
[Eval][v2][zh] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 31%
[Eval][v1][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts