[Eval][v2][zh] Third-party payment gateway timeouts cascade into checkout outage
service: checkout-svccreated: 6/2/2026, 11:58:15 PM
Raw incident context
Time: 18:40 UTC. checkout-svc starts returning HTTP 504 to ~78% of checkout attempts. Symptoms: - p99 latency: 28s (hitting our 30s gateway timeout) - Success rate: 22% (down from 99.5%) - Thread pool utilization: 100% (all 200 worker threads blocked) - Inbound queue depth: 4200 requests (queue limit 5000) Downstream calls (from APM): - Stripe Connect API (/v1/payment_intents): p99 jumped from 800ms to 27s - Stripe status page: "Investigating elevated latency for Connect endpoints in us-east-1" (posted 18:35 UTC) - All other downstream deps (auth-svc, fraud-svc) normal Our config: - Stripe call timeout: 30s (matches inbound) - No circuit breaker on Stripe client - No bulkhead — Stripe calls share the main worker thread pool Customer impact: - Failed checkouts: ~3000 in 5min - Estimated lost GMV: $180k (so far) On-call: - 18:38 — pager - 18:41 — confirmed Stripe is the culprit (status page + APM) - 18:43 — debating: wait it out vs disable Stripe path entirely
Summary
checkout-svc 因下游 Stripe Connect API 延迟飙升导致线程池耗尽,78% 的结账请求超时失败,p99 延迟 28s,成功率降至 22%。
Severity reasoning: 用户面结账功能中断,错误率 >1%(实际 78%),收入路径断裂(估计损失 $180k),符合 SEV1 标准。
deepseek-chat·prompt v2·output: zh·8017ms·2162↑ / 1222↓ tok·$0.00193
Root cause hypotheses
- highStripe Connect API 在 us-east-1 区域出现延迟问题,导致 checkout-svc 所有调用超时。
Evidence: APM 显示 Stripe /v1/payment_intents p99 从 800ms 升至 27s;Stripe 状态页确认延迟问题。
- highcheckout-svc 未配置断路器或 bulkhead,导致 Stripe 延迟耗尽所有工作线程。
Evidence: 线程池利用率 100%,队列深度 4200,且无断路器或 bulkhead 配置。
- lowcheckout-svc 自身代码缺陷导致连接泄漏或资源未释放。
Evidence: 无直接证据,但线程池完全阻塞可能由资源泄漏加剧。
Investigation checklist
- 确认 Stripe 延迟是否持续
curl -w '%{time_total}' -o /dev/null -s https://api.stripe.com/v1/payment_intents -H 'Authorization: Bearer <token>'Expected: 响应时间 <1s 表示恢复,>10s 表示问题持续
- 检查线程池和队列状态
kubectl exec -n prod deploy/checkout-svc -- curl -s localhost:8080/actuator/threaddump | grep -E 'pool-.*thread' | head -20Expected: 大部分线程状态为 BLOCKED 或 WAITING
- 检查 Stripe 客户端连接池
kubectl exec -n prod deploy/checkout-svc -- curl -s localhost:8080/actuator/metrics/http.client.requests | jq '.measurements[] | select(.statistic == "COUNT")'Expected: 大量未完成请求
- 检查是否有其他下游依赖异常
kubectl logs -n prod -l app=checkout-svc --since=30m | grep -E 'auth-svc|fraud-svc' | tail -20Expected: 无超时或错误日志
Mitigation plan
启用断路器:在 Stripe 客户端配置断路器,失败率达到 50% 时快速失败,避免线程池耗尽。
Risk: 断路器打开后,部分请求会立即失败,但可保护系统整体。
Rollback: 回滚断路器配置,恢复为直连 Stripe。
增加线程池大小:将 worker 线程从 200 临时增加到 400,以处理积压请求。
Risk: 可能增加资源竞争,但可缓解当前阻塞。
Rollback: 将线程池大小恢复为 200。
降级 Stripe 支付:临时禁用 Stripe 支付路径,改为仅支持其他支付方式(如 PayPal)。
Risk: 部分用户无法使用 Stripe 支付,但可恢复结账功能。
Rollback: 重新启用 Stripe 支付路径。
Customer impact
约 78% 的结账请求失败,影响约 3000 笔交易,估计损失 $180k。用户看到 HTTP 504 超时错误。
Postmortem draft
Summary
[FILL IN]
Timeline (UTC)
[FILL IN]
Impact
[FILL IN]
Root Cause
[FILL IN]
Detection
[FILL IN]
Response
[FILL IN]
What Went Well
[FILL IN]
What Went Poorly
[FILL IN]
Action Items
[FILL IN]
Follow-ups
- P0为所有下游依赖添加断路器(Resilience4j)— service owner
- P0为 Stripe 调用添加 bulkhead 隔离— service owner
- P1设置 Stripe 状态页监控告警— platform team
- P1增加线程池队列监控告警— on-call SRE
- P2评估 Stripe 超时时间是否应小于网关超时— service owner
Similar past incidents
lexical match (pg_trgm)
- 74%
[Eval][v1][zh] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 70%
[Eval][v2][zh] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 54%
[Eval][v1][en] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 47%
[Eval][v2][en] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 24%
[Eval][v1][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts