[Eval][v2][en] Third-party payment gateway timeouts cascade into checkout outage
service: checkout-svccreated: 5/25/2026, 10:22:02 PM
Raw incident context
Time: 18:40 UTC. checkout-svc starts returning HTTP 504 to ~78% of checkout attempts. Symptoms: - p99 latency: 28s (hitting our 30s gateway timeout) - Success rate: 22% (down from 99.5%) - Thread pool utilization: 100% (all 200 worker threads blocked) - Inbound queue depth: 4200 requests (queue limit 5000) Downstream calls (from APM): - Stripe Connect API (/v1/payment_intents): p99 jumped from 800ms to 27s - Stripe status page: "Investigating elevated latency for Connect endpoints in us-east-1" (posted 18:35 UTC) - All other downstream deps (auth-svc, fraud-svc) normal Our config: - Stripe call timeout: 30s (matches inbound) - No circuit breaker on Stripe client - No bulkhead — Stripe calls share the main worker thread pool Customer impact: - Failed checkouts: ~3000 in 5min - Estimated lost GMV: $180k (so far) On-call: - 18:38 — pager - 18:41 — confirmed Stripe is the culprit (status page + APM) - 18:43 — debating: wait it out vs disable Stripe path entirely
Summary
Checkout service is failing for ~78% of users due to Stripe Connect API latency spike (p99 27s vs 800ms baseline). All 200 worker threads are blocked waiting on Stripe responses, causing thread pool exhaustion and 504 timeouts. Stripe has acknowledged the issue on their status page. Estimated $180k GMV lost in 5 minutes.
Severity reasoning: User-facing outage: success rate dropped from 99.5% to 22% (>1% error rate for >5 min). Revenue path broken: estimated $180k GMV lost. Regional unavailability: checkout service effectively down for most users.
deepseek-chat·prompt v2·output: en·14767ms
Root cause hypotheses
- highStripe Connect API latency spike causing thread pool exhaustion
Evidence: APM shows Stripe /v1/payment_intents p99 jumped from 800ms to 27s; thread pool utilization 100%; Stripe status page confirms elevated latency in us-east-1 since 18:35 UTC.
- highNo circuit breaker on Stripe client causing cascading failures
Evidence: Config review shows no circuit breaker or bulkhead for Stripe calls; all threads block waiting on Stripe, leading to queue buildup (4200 requests).
- mediumStripe API timeout mismatch causing retry storms
Evidence: Inbound gateway timeout is 30s, Stripe call timeout is also 30s; no client-side retry policy visible; but if retries exist, they could amplify load.
- lowResource exhaustion on checkout-svc pods (CPU/memory)
Evidence: Thread pool exhaustion is the primary symptom, but CPU/memory metrics not provided; could be secondary factor.
- lowNetwork issue between checkout-svc and Stripe
Evidence: Stripe status page indicates their own issue; network path unlikely given other services normal.
Investigation checklist
- Check Stripe status page for updates
curl -s https://status.stripe.com/api/v2/status.json | jq '.status.description'Expected: Should show 'Investigating elevated latency for Connect endpoints in us-east-1' or resolved status.
- Verify thread pool exhaustion in checkout-svc logs
kubectl logs -n prod -l app=checkout-svc --since=20m | grep -i 'thread pool' | tail -20Expected: Lines indicating all threads busy, queue full, or rejected tasks.
- Check APM for Stripe dependency latency breakdown
curl -s 'http://apm-dashboard:8080/api/v1/services/checkout-svc/dependencies/stripe?time=18:30-18:50' | jq '.p99_latency_ms'Expected: p99 latency around 27000ms (27s) vs baseline 800ms.
- Inspect Stripe client configuration for circuit breaker and timeout
kubectl exec -n prod deploy/checkout-svc -- cat /app/config/stripe.yaml | grep -E 'timeout|circuit|bulkhead|retry'Expected: No circuit breaker or bulkhead settings; timeout likely 30s.
- Check inbound queue depth and request rate
kubectl exec -n prod deploy/checkout-svc -- curl -s localhost:8080/metrics | grep -E 'queue_depth|request_rate'Expected: queue_depth near 4200, request_rate stable or increasing.
- Verify other downstream services are healthy
kubectl logs -n prod -l app=auth-svc --since=10m | grep -c 'error' && kubectl logs -n prod -l app=fraud-svc --since=10m | grep -c 'error'Expected: Low error counts (0-5) indicating auth and fraud are normal.
- Check if Stripe API key or rate limit issues exist
kubectl logs -n prod -l app=checkout-svc --since=20m | grep -iE 'rate limit|429|too many requests' | head -10Expected: No rate limit errors; Stripe issue is latency, not throttling.
Mitigation plan
Enable circuit breaker on Stripe client to fail fast and shed load
Risk: May cause partial checkout failures for users if circuit opens; but better than total outage. Safer than disabling Stripe entirely.
Rollback: Disable circuit breaker by setting threshold to 0 or removing config; restart pods.
Increase thread pool size temporarily to handle queued requests
Risk: May increase resource pressure (CPU/memory) on pods; could cause OOM if too large. Monitor closely.
Rollback: Revert thread pool size to original value and restart pods.
Implement bulkhead isolation for Stripe calls to prevent thread pool exhaustion
Risk: Requires code change and deployment; not immediate. For now, use circuit breaker as faster mitigation.
Rollback: Remove bulkhead config and restart.
Reduce Stripe call timeout to 5s to fail fast and avoid queue buildup
Risk: May cause false positives if Stripe latency is transient; but 5s is reasonable for payment intents. Safer than 30s.
Rollback: Revert timeout to 30s and restart pods.
Customer impact
Approximately 78% of checkout attempts are failing with HTTP 504 errors. Users see a timeout or error page after waiting up to 30 seconds. Estimated 3000 failed checkouts in 5 minutes, resulting in $180k lost GMV. No data loss; payments are not processed. ETA depends on Stripe recovery.
Postmortem draft
Summary
[FILL IN: 2-3 sentence summary of incident]
Timeline (UTC)
- 18:35 - Stripe status page: "Investigating elevated latency for Connect endpoints in us-east-1"
- 18:38 - Pager triggered for checkout-svc high error rate
- 18:41 - Confirmed Stripe is the culprit via APM and status page
- 18:43 - Debate: wait vs disable Stripe path
- [FILL IN: mitigation actions and resolution time]
Impact
- Checkout success rate dropped from 99.5% to 22%
- p99 latency 28s (30s timeout)
- ~3000 failed checkouts, $180k lost GMV
- Duration: [FILL IN: start to end time]
Root Cause
Stripe Connect API latency spike (p99 27s vs 800ms baseline) caused thread pool exhaustion in checkout-svc due to lack of circuit breaker and bulkhead isolation.
Detection
- Pager triggered by error rate alert (threshold >1% for 5 min)
- Stripe status page provided external confirmation
Response
- [FILL IN: actions taken, e.g., enabled circuit breaker, increased thread pool]
- [FILL IN: communication with Stripe support]
What Went Well
- Quick identification of Stripe as root cause via APM and status page
- Team communication clear
What Went Poorly
- No circuit breaker or bulkhead on Stripe client
- Timeout too long (30s) causing queue buildup
- No automated failover to alternative payment provider
Action Items
- [FILL IN: implement circuit breaker and bulkhead for Stripe]
- [FILL IN: reduce Stripe timeout to 5s]
- [FILL IN: add alert on thread pool utilization >80%]
- [FILL IN: consider fallback payment provider]
Follow-ups
- P0Implement circuit breaker on Stripe client with configurable threshold and half-open recovery— service owner
- P0Add bulkhead isolation for Stripe calls to separate thread pool— service owner
- P1Reduce Stripe call timeout from 30s to 5s— service owner
- P1Add alert on thread pool utilization >80% and queue depth >1000— on-call SRE
- P2Evaluate fallback payment provider for critical path— platform team
- P1Update incident runbook for Stripe dependency failures with circuit breaker steps— on-call SRE
- P2Review Stripe API retry policy and ensure exponential backoff with jitter— service owner
Similar past incidents
lexical match (pg_trgm)
- 75%
[Eval][v1][en] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 46%
[Eval][v1][zh] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 44%
[Eval][v2][zh] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 35%
[Eval][v1][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 33%
[Eval][v2][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts