[Eval][v2][zh] Third-party payment gateway timeouts cascade into checkout outage
service: checkout-svccreated: 5/25/2026, 10:22:18 PM
Raw incident context
Time: 18:40 UTC. checkout-svc starts returning HTTP 504 to ~78% of checkout attempts. Symptoms: - p99 latency: 28s (hitting our 30s gateway timeout) - Success rate: 22% (down from 99.5%) - Thread pool utilization: 100% (all 200 worker threads blocked) - Inbound queue depth: 4200 requests (queue limit 5000) Downstream calls (from APM): - Stripe Connect API (/v1/payment_intents): p99 jumped from 800ms to 27s - Stripe status page: "Investigating elevated latency for Connect endpoints in us-east-1" (posted 18:35 UTC) - All other downstream deps (auth-svc, fraud-svc) normal Our config: - Stripe call timeout: 30s (matches inbound) - No circuit breaker on Stripe client - No bulkhead — Stripe calls share the main worker thread pool Customer impact: - Failed checkouts: ~3000 in 5min - Estimated lost GMV: $180k (so far) On-call: - 18:38 — pager - 18:41 — confirmed Stripe is the culprit (status page + APM) - 18:43 — debating: wait it out vs disable Stripe path entirely
Summary
checkout-svc 的结账成功率从 99.5% 骤降至 22%,p99 延迟达到 28 秒(接近 30 秒超时),线程池耗尽。根本原因是下游 Stripe Connect API 在 us-east-1 区域出现延迟飙升,导致所有工作线程阻塞。
Severity reasoning: 用户可见的结账失败率高达 78%,超过 1% 阈值且持续超过 5 分钟;收入链路(Stripe 支付)完全中断,预计损失 GMV $180k;符合 SEV1 定义。
deepseek-chat·prompt v2·output: zh·9672ms
Root cause hypotheses
- highStripe Connect API 在 us-east-1 区域出现延迟飙升,导致 checkout-svc 的 HTTP 调用超时,线程池被占满。
Evidence: APM 显示 Stripe /v1/payment_intents 的 p99 延迟从 800ms 升至 27s;Stripe 状态页面确认 us-east-1 区域延迟问题。
- highcheckout-svc 的 Stripe 客户端未配置熔断器,导致所有请求持续等待超时,线程池无法释放。
Evidence: 配置中无熔断器,线程池利用率 100%,队列深度 4200。
- lowcheckout-svc 的线程池过小,无法应对突发流量,但正常情况下足够。
Evidence: 线程池大小 200,正常时利用率低;当前阻塞由 Stripe 延迟引起,非线程池本身不足。
Investigation checklist
- 确认 Stripe 状态页面是否更新
curl -s https://status.stripe.com/ | grep -i 'us-east-1'Expected: 显示 'Investigating' 或 'Resolved' 状态
- 检查 checkout-svc 当前线程池和队列状态
kubectl exec -n prod deploy/checkout-svc -- curl -s localhost:8080/actuator/threaddump | grep -E 'pool-.*thread' | head -20Expected: 显示所有线程处于 BLOCKED 或 WAITING 状态
- 检查 Stripe 调用超时配置
kubectl exec -n prod deploy/checkout-svc -- cat /app/config/application.yml | grep -A5 'stripe'Expected: 显示 timeout: 30000ms,无熔断器配置
- 检查其他下游服务是否正常
kubectl logs -n prod -l app=auth-svc --since=10m | grep -c 'error'Expected: 错误计数接近 0
Mitigation plan
启用 Stripe 客户端的熔断器,快速失败而非等待超时。
Risk: 熔断后所有 Stripe 请求立即失败,可能导致结账完全不可用,但可保护线程池。
Rollback: 回滚配置,移除熔断器设置。
将 Stripe 调用隔离到独立线程池(bulkhead),避免耗尽主线程池。
Risk: 需要重启服务或动态调整线程池,可能短暂影响请求。
Rollback: 恢复为共享线程池配置。
临时禁用 Stripe 支付路径,返回友好错误提示。
Risk: 所有 Stripe 支付失败,但可保留其他支付方式(如有)。
Rollback: 重新启用 Stripe 路径。
Customer impact
约 78% 的结账尝试失败,影响约 3000 笔订单,预计损失 $180k。用户看到 HTTP 504 超时错误。
Postmortem draft
Summary
[FILL IN]
Timeline (UTC)
- 18:35 - Stripe 状态页面报告 us-east-1 区域延迟问题
- 18:40 - checkout-svc 开始返回 504,成功率降至 22%
- 18:41 - 确认 Stripe 为根因
- 18:43 - 讨论缓解方案
Impact
- 结账成功率从 99.5% 降至 22%
- p99 延迟 28s
- 约 3000 笔订单失败,GMV 损失 $180k
Root Cause
Stripe Connect API 在 us-east-1 区域延迟飙升,导致 checkout-svc 的 HTTP 调用超时,线程池耗尽。
Detection
- 监控告警触发(成功率、延迟、线程池)
- Stripe 状态页面确认
Response
[FILL IN]
What Went Well
- 快速定位根因
- 状态页面信息及时
What Went Poorly
- 无熔断器保护
- 线程池未隔离
Action Items
- [ ] 为 Stripe 客户端添加熔断器
- [ ] 实现 bulkhead 隔离
- [ ] 增加 Stripe 延迟告警
Follow-ups
- P0为 Stripe 客户端添加熔断器(如 Resilience4j)— service owner
- P0实现线程池隔离(bulkhead)— service owner
- P1增加 Stripe API 延迟告警(p99 > 5s)— on-call SRE
- P1审查所有下游依赖的熔断器配置— platform team
- P2更新 runbook 以包含 Stripe 故障处理步骤— on-call SRE
Similar past incidents
lexical match (pg_trgm)
- 66%
[Eval][v1][zh] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 48%
[Eval][v1][en] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 44%
[Eval][v2][en] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 23%
[Eval][v1][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 22%
[Eval][v2][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts