[Eval][v3][en] Third-party payment gateway timeouts cascade into checkout outage
service: checkout-svccreated: 6/2/2026, 11:58:49 PM
Raw incident context
Time: 18:40 UTC. checkout-svc starts returning HTTP 504 to ~78% of checkout attempts. Symptoms: - p99 latency: 28s (hitting our 30s gateway timeout) - Success rate: 22% (down from 99.5%) - Thread pool utilization: 100% (all 200 worker threads blocked) - Inbound queue depth: 4200 requests (queue limit 5000) Downstream calls (from APM): - Stripe Connect API (/v1/payment_intents): p99 jumped from 800ms to 27s - Stripe status page: "Investigating elevated latency for Connect endpoints in us-east-1" (posted 18:35 UTC) - All other downstream deps (auth-svc, fraud-svc) normal Our config: - Stripe call timeout: 30s (matches inbound) - No circuit breaker on Stripe client - No bulkhead — Stripe calls share the main worker thread pool Customer impact: - Failed checkouts: ~3000 in 5min - Estimated lost GMV: $180k (so far) On-call: - 18:38 — pager - 18:41 — confirmed Stripe is the culprit (status page + APM) - 18:43 — debating: wait it out vs disable Stripe path entirely
Summary
Checkout success rate dropped from 99.5% to 22% due to upstream Stripe Connect API latency spike (p99 27s), causing thread pool exhaustion and 30s timeouts. Stripe status page confirms elevated latency for Connect endpoints in us-east-1 since 18:35 UTC. No circuit breaker or bulkhead on Stripe calls, leading to full thread pool saturation and cascading failures.
Severity reasoning: User-facing outage: error rate > 1% for >5 min (78% failure rate for 5+ minutes), revenue path broken (failed checkouts ~3000, lost GMV $180k). Matches SEV1 criteria.
deepseek-chat·prompt v3·output: en·12615ms·2002↑ / 1892↓ tok·$0.00262
Root cause hypotheses
- highStripe Connect API latency spike causing thread pool exhaustion
Evidence: APM shows Stripe /v1/payment_intents p99 jumped from 800ms to 27s; thread pool utilization 100%; success rate 22%.
- highNo circuit breaker on Stripe client allows unbounded retries
Evidence: Config review: no circuit breaker or bulkhead; Stripe calls share main worker thread pool.
- mediumInbound request queue overflow causing additional timeouts
Evidence: Queue depth 4200 (limit 5000), nearing capacity; could amplify failures.
- highStripe API degradation due to regional issue in us-east-1
Evidence: Stripe status page: 'Investigating elevated latency for Connect endpoints in us-east-1' posted 18:35 UTC.
- lowApplication-level bug causing excessive Stripe calls per checkout
Evidence: No evidence in context; normal call pattern assumed.
Investigation checklist
- Confirm Stripe latency via APM and logs
kubectl logs -n prod -l app=checkout-svc --since=20m | grep -iE 'stripe|payment_intent|timeout' | tail -50Expected: Logs showing Stripe calls taking >20s or timing out
- Check thread pool and queue metrics
curl -s http://checkout-svc:8080/metrics | grep -E 'thread_pool|queue_depth'Expected: thread_pool_utilization 1.0, queue_depth ~4200
- Verify Stripe status page
curl -s https://status.stripe.com/api/v1/incidents | jq '.incidents[] | select(.status != "resolved")'Expected: Active incident for Connect endpoints in us-east-1
- Check downstream dependencies (auth-svc, fraud-svc) latency
kubectl logs -n prod -l app=checkout-svc --since=20m | grep -iE 'auth|fraud' | tail -20Expected: Normal latency (<1s)
- Review circuit breaker configuration
kubectl exec -n prod deploy/checkout-svc -- cat /app/config.yaml | grep -A5 'stripe'Expected: No circuit breaker or bulkhead configured
Mitigation plan
Enable circuit breaker for Stripe calls with fallback to fail-fast (e.g., return 503 immediately after 5s timeout)
Risk: May cause partial checkout failures but prevents thread pool exhaustion; safe to rollback
Rollback: Disable circuit breaker via feature flag or config reload
Increase thread pool size temporarily (e.g., from 200 to 400) to absorb latency spike
Risk: May increase memory pressure; could mask underlying issue
Rollback: Revert thread pool size to 200
Implement bulkhead for Stripe calls (dedicated thread pool of 50 threads)
Risk: May limit throughput to Stripe; safe to rollback
Rollback: Remove bulkhead config
If Stripe degradation persists >15 min, disable Stripe payment path and fall back to alternative provider or manual processing
Risk: Lost revenue from Stripe payments; requires manual intervention
Rollback: Re-enable Stripe path after Stripe resolves
Customer impact
Approximately 78% of checkout attempts are failing with a 504 error. Users see a timeout or error page after waiting up to 30 seconds. Estimated 3,000 failed checkouts in the last 5 minutes, resulting in approximately $180k lost revenue. No data loss; payments are not processed.
Postmortem draft
Summary
Checkout service experienced a severe degradation from 18:35 to [FILL IN] UTC due to upstream Stripe Connect API latency spike. Lack of circuit breaker and bulkhead led to thread pool exhaustion and cascading failures.
Timeline (UTC)
- 18:35 - Stripe status page reports elevated latency for Connect endpoints in us-east-1
- 18:38 - Pager alert for checkout-svc high error rate
- 18:40 - Success rate drops to 22%, p99 latency 28s
- 18:41 - Confirmed Stripe as culprit via APM and status page
- 18:43 - Team debating mitigation options
- [FILL IN] - Mitigation applied (circuit breaker enabled)
- [FILL IN] - Service recovered
Impact
- Error rate: 78% of checkout requests failed
- Failed checkouts: ~3000 in 5 minutes
- Lost GMV: ~$180k
- Affected users: all users attempting checkout
Root Cause
Upstream Stripe Connect API latency spike (p99 27s) combined with lack of circuit breaker and bulkhead in checkout-svc. All worker threads blocked on Stripe calls, causing thread pool exhaustion and 30s timeouts.
Detection
Alert from pager at 18:38 based on error rate threshold. Stripe status page confirmed external issue.
Response
- 18:38: Pager acknowledged
- 18:41: Root cause identified
- [FILL IN]: Mitigation deployed
What Went Well
- Quick identification of upstream dependency
- Stripe status page provided confirmation
What Went Poorly
- No circuit breaker or bulkhead for critical dependency
- No automated fallback to alternative payment provider
- Thread pool exhaustion caused full outage
Action Items
- [ ] Add circuit breaker to Stripe client (P0)
- [ ] Implement bulkhead for Stripe calls (P0)
- [ ] Set up synthetic monitoring for Stripe latency (P1)
- [ ] Document runbook for Stripe degradation (P1)
- [ ] Evaluate alternative payment provider for failover (P2)
Follow-ups
- P0Add circuit breaker to Stripe client with fallback— checkout-svc team
- P0Implement bulkhead for Stripe calls (dedicated thread pool)— checkout-svc team
- P1Set up synthetic monitoring for Stripe latency and alerting— platform team
- P1Document runbook for Stripe degradation with step-by-step mitigation— on-call SRE
- P2Evaluate alternative payment provider for failover— payments-platform
- P2Review thread pool sizing and queue limits for checkout-svc— checkout-svc team
Similar past incidents
lexical match (pg_trgm)
- 54%
[Eval][v1][en] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 54%
[Eval][v2][en] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 48%
[Eval][v2][zh] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 48%
[Eval][v1][zh] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 31%
[Eval][v2][en] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.