[Eval][v1][zh] Third-party payment gateway timeouts cascade into checkout outage
service: checkout-svccreated: 6/2/2026, 11:56:28 PM
Raw incident context
Time: 18:40 UTC. checkout-svc starts returning HTTP 504 to ~78% of checkout attempts. Symptoms: - p99 latency: 28s (hitting our 30s gateway timeout) - Success rate: 22% (down from 99.5%) - Thread pool utilization: 100% (all 200 worker threads blocked) - Inbound queue depth: 4200 requests (queue limit 5000) Downstream calls (from APM): - Stripe Connect API (/v1/payment_intents): p99 jumped from 800ms to 27s - Stripe status page: "Investigating elevated latency for Connect endpoints in us-east-1" (posted 18:35 UTC) - All other downstream deps (auth-svc, fraud-svc) normal Our config: - Stripe call timeout: 30s (matches inbound) - No circuit breaker on Stripe client - No bulkhead — Stripe calls share the main worker thread pool Customer impact: - Failed checkouts: ~3000 in 5min - Estimated lost GMV: $180k (so far) On-call: - 18:38 — pager - 18:41 — confirmed Stripe is the culprit (status page + APM) - 18:43 — debating: wait it out vs disable Stripe path entirely
Summary
checkout-svc 因 Stripe Connect API 在 us-east-1 区域出现高延迟,导致线程池耗尽,78% 的结账请求超时失败,成功率从 99.5% 骤降至 22%,p99 延迟达到 28 秒,已造成约 $180k GMV 损失。
Severity reasoning: 用户可见的结账失败,直接影响收入($180k 损失),且影响范围广(78% 请求),属于 SEV1。
deepseek-chat·prompt v1·output: zh·8523ms·1781↑ / 1418↓ tok·$0.00204
Root cause hypotheses
- highStripe Connect API 在 us-east-1 区域出现延迟峰值,导致 checkout-svc 的 HTTP 调用阻塞线程池。
Evidence: APM 显示 Stripe /v1/payment_intents p99 从 800ms 升至 27s;Stripe 状态页确认 us-east-1 区域 Connect 端点延迟升高。
- highcheckout-svc 未对 Stripe 调用设置熔断器或隔离机制,导致单个下游故障耗尽所有工作线程。
Evidence: 配置中 Stripe 超时 30s,无 circuit breaker,无 bulkhead,线程池 200 个线程全部阻塞。
- mediumStripe 调用超时设置过长(30s),与网关超时一致,导致请求堆积在队列中。
Evidence: 入站队列深度 4200(限制 5000),p99 延迟 28s 接近超时。
Investigation checklist
- 确认 Stripe 状态页最新更新
curl -s https://status.stripe.com | grep -i 'connect'Expected: 显示 'Investigating' 或 'Resolved' 状态
- 检查 checkout-svc 线程池和队列指标
kubectl exec -n prod deploy/checkout-svc -- curl -s localhost:8080/metrics | grep -E 'thread_pool|queue_depth'Expected: thread_pool_utilization 100%,queue_depth 接近 5000
- 查看 APM 中 Stripe 调用的延迟分布
kubectl exec -n prod deploy/checkout-svc -- curl -s localhost:8080/actuator/health | jq '.details.stripe'Expected: 显示 p99 延迟 > 20s
- 检查是否有其他下游依赖异常
kubectl exec -n prod deploy/checkout-svc -- curl -s localhost:8080/actuator/health | jq '.details'Expected: auth-svc 和 fraud-svc 状态正常
Mitigation plan
立即启用 Stripe 客户端的熔断器,设置失败阈值(例如 50% 请求失败时熔断 30 秒),并降低超时到 5 秒。
Risk: 熔断期间所有 Stripe 支付失败,但可避免线程池耗尽,保留部分结账能力。
Rollback: 回滚配置:恢复超时到 30s,关闭熔断器。
如果 Stripe 故障持续,临时禁用 Stripe 支付路径,仅保留其他支付方式(如 PayPal)。
Risk: 完全失去 Stripe 支付能力,但可恢复结账成功率。
Rollback: 重新启用 Stripe 路径,并确认 Stripe 状态已恢复。
Customer impact
约 78% 的结账尝试失败,用户看到 HTTP 504 错误,无法完成支付。过去 5 分钟影响约 3000 笔交易,预计损失 $180k。
Postmortem draft
事后复盘
摘要
checkout-svc 因 Stripe Connect API 延迟导致线程池耗尽,结账成功率从 99.5% 降至 22%。
时间线
- 18:35 UTC: Stripe 状态页报告 us-east-1 Connect 延迟
- 18:40 UTC: checkout-svc 开始返回 504
- 18:41 UTC: 确认 Stripe 为根因
- 18:43 UTC: 讨论缓解方案
影响
- 失败交易 ~3000 笔
- 损失 GMV ~$180k
根因
Stripe Connect API 延迟导致线程池耗尽,缺乏熔断和隔离机制。
做得好的
- 快速识别 Stripe 故障
- APM 数据清晰
做得不好的
- 无熔断器
- 无线程池隔离
行动项
- 为 Stripe 客户端添加熔断器
- 实现 bulkhead 隔离
- 降低 Stripe 超时时间
Follow-ups
- P0为 Stripe 客户端添加熔断器(Hystrix/Resilience4j)— service owner
- P0实现线程池隔离(bulkhead)将 Stripe 调用与主线程池分离— service owner
- P1降低 Stripe 调用超时到 5 秒,并设置合理的重试策略— service owner
- P1添加 Stripe 状态页监控告警— platform team
- P2审查所有下游依赖的熔断和隔离配置— on-call SRE
Similar past incidents
lexical match (pg_trgm)
- 83%
[Eval][v1][zh] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 74%
[Eval][v2][zh] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 53%
[Eval][v1][en] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 46%
[Eval][v2][en] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 23%
[Eval][v1][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts