[Eval][v1][en] Third-party payment gateway timeouts cascade into checkout outage
service: checkout-svccreated: 6/2/2026, 11:55:46 PM
Raw incident context
Time: 18:40 UTC. checkout-svc starts returning HTTP 504 to ~78% of checkout attempts. Symptoms: - p99 latency: 28s (hitting our 30s gateway timeout) - Success rate: 22% (down from 99.5%) - Thread pool utilization: 100% (all 200 worker threads blocked) - Inbound queue depth: 4200 requests (queue limit 5000) Downstream calls (from APM): - Stripe Connect API (/v1/payment_intents): p99 jumped from 800ms to 27s - Stripe status page: "Investigating elevated latency for Connect endpoints in us-east-1" (posted 18:35 UTC) - All other downstream deps (auth-svc, fraud-svc) normal Our config: - Stripe call timeout: 30s (matches inbound) - No circuit breaker on Stripe client - No bulkhead — Stripe calls share the main worker thread pool Customer impact: - Failed checkouts: ~3000 in 5min - Estimated lost GMV: $180k (so far) On-call: - 18:38 — pager - 18:41 — confirmed Stripe is the culprit (status page + APM) - 18:43 — debating: wait it out vs disable Stripe path entirely
Summary
Checkout success rate dropped from 99.5% to 22% due to elevated latency from Stripe Connect API, causing thread pool exhaustion and 504 errors. Estimated $180k GMV lost in 5 minutes.
Severity reasoning: User-facing outage with revenue impact (estimated $180k GMV lost), affecting ~78% of checkout attempts. Full outage of a critical payment path.
deepseek-chat·prompt v1·output: en·9142ms·1370↑ / 1439↓ tok·$0.00195
Root cause hypotheses
- highStripe Connect API latency spike in us-east-1 causing downstream timeout
Evidence: APM shows Stripe /v1/payment_intents p99 jumped from 800ms to 27s; Stripe status page confirms elevated latency for Connect endpoints in us-east-1 since 18:35 UTC.
- highNo circuit breaker on Stripe client causing cascading thread pool exhaustion
Evidence: Thread pool utilization at 100% with all 200 worker threads blocked; Stripe call timeout matches inbound timeout (30s), no bulkhead isolation.
- mediumInbound request surge amplifying Stripe latency impact
Evidence: Inbound queue depth 4200 (limit 5000), indicating high traffic volume; combined with slow Stripe responses, threads are held longer.
Investigation checklist
- Confirm Stripe latency via APM
curl -s 'http://apm-dashboard:8080/api/traces?service=checkout-svc&span=stripe-payment&from=18:35&to=18:45' | jq '.p99_latency_ms'Expected: Value near 27000ms, confirming Stripe is the bottleneck.
- Check thread pool state
kubectl exec deploy/checkout-svc -- curl -s localhost:8080/actuator/threaddump | grep 'pool-1-thread' | wc -lExpected: 200 threads in BLOCKED or WAITING state.
- Verify Stripe status page
curl -s 'https://status.stripe.com/api/v1/incidents?since=18:30' | jq '.incidents[] | select(.name | contains("Connect"))'Expected: Incident with status 'investigating' for Connect endpoints in us-east-1.
- Check if circuit breaker exists
kubectl exec deploy/checkout-svc -- grep -r 'circuitBreaker' /app/config/Expected: No output, confirming no circuit breaker configured.
Mitigation plan
Enable feature flag to disable Stripe payment path and fall back to a degraded mode (e.g., show 'payment unavailable' or use alternative provider if available).
Risk: Customers cannot complete checkout via Stripe; potential revenue loss but prevents further timeouts and thread exhaustion.
Rollback: Disable the feature flag to re-enable Stripe path once Stripe recovers.
If fallback not possible, increase thread pool size temporarily to 400 and reduce Stripe timeout to 10s to fail fast.
Risk: Increased thread count may cause memory pressure; shorter timeout may cause partial failures but reduces thread blocking.
Rollback: Revert thread pool to 200 and timeout to 30s after incident.
Customer impact
Approximately 78% of checkout attempts are failing with HTTP 504 errors. Users see a timeout or error page after 30 seconds. Estimated 3000 failed checkouts in 5 minutes, with $180k GMV lost. No ETA for resolution; dependent on Stripe recovery.
Postmortem draft
Postmortem: Checkout-svc Outage (2025-03-27)
Summary
Brief description of the incident.
Timeline
- 18:35 UTC: Stripe reports elevated latency for Connect endpoints.
- 18:40 UTC: checkout-svc success rate drops to 22%, p99 latency 28s.
- 18:41 UTC: On-call paged; Stripe identified as culprit.
- 18:43 UTC: Mitigation actions taken.
- [Time] UTC: Incident resolved.
Impact
- Failed checkouts: ~3000
- Lost GMV: $180k
- Duration: [X] minutes
Root Cause
Stripe Connect API latency spike combined with lack of circuit breaker and bulkhead caused thread pool exhaustion.
What Went Well
- Quick identification of Stripe as root cause via APM and status page.
- [Other positives]
What Went Poorly
- No circuit breaker on Stripe client.
- No bulkhead isolation for downstream calls.
- [Other negatives]
Action Items
- [ ] Add circuit breaker to Stripe client (P0)
- [ ] Implement bulkhead pattern for downstream dependencies (P1)
- [ ] Reduce Stripe timeout to 10s with retry (P1)
- [ ] Add alert on thread pool utilization >80% (P2)
Follow-ups
- P0Add circuit breaker to Stripe client with fallback logic— service owner
- P1Implement bulkhead pattern for downstream calls (separate thread pool for Stripe)— service owner
- P1Reduce Stripe timeout to 10s and add retry with exponential backoff— service owner
- P2Add alert on thread pool utilization >80% and queue depth >3000— on-call SRE
- P2Review and update incident response runbook for Stripe dependency failures— platform team
Similar past incidents
lexical match (pg_trgm)
- 70%
[Eval][v1][en] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 62%
[Eval][v2][en] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 60%
[Eval][v1][zh] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 57%
[Eval][v2][zh] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 32%
[Eval][v1][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts