[Eval][v1][zh] Third-party payment gateway timeouts cascade into checkout outage
service: checkout-svccreated: 6/2/2026, 11:38:11 PM
Raw incident context
Time: 18:40 UTC. checkout-svc starts returning HTTP 504 to ~78% of checkout attempts. Symptoms: - p99 latency: 28s (hitting our 30s gateway timeout) - Success rate: 22% (down from 99.5%) - Thread pool utilization: 100% (all 200 worker threads blocked) - Inbound queue depth: 4200 requests (queue limit 5000) Downstream calls (from APM): - Stripe Connect API (/v1/payment_intents): p99 jumped from 800ms to 27s - Stripe status page: "Investigating elevated latency for Connect endpoints in us-east-1" (posted 18:35 UTC) - All other downstream deps (auth-svc, fraud-svc) normal Our config: - Stripe call timeout: 30s (matches inbound) - No circuit breaker on Stripe client - No bulkhead — Stripe calls share the main worker thread pool Customer impact: - Failed checkouts: ~3000 in 5min - Estimated lost GMV: $180k (so far) On-call: - 18:38 — pager - 18:41 — confirmed Stripe is the culprit (status page + APM) - 18:43 — debating: wait it out vs disable Stripe path entirely
Summary
checkout-svc 因依赖的 Stripe Connect API 在 us-east-1 区域出现高延迟,导致线程池耗尽,78% 的结账请求超时失败。已确认 Stripe 状态页面报告问题,当前正在评估缓解措施。
Severity reasoning: 用户可见的结账失败率达到 78%,直接影响收入(估计损失 $180k),且无降级路径,符合 SEV1 标准。
deepseek-chat·prompt v1·output: zh·9258ms·1781↑ / 1406↓ tok·$0.00203
Root cause hypotheses
- highStripe Connect API 在 us-east-1 区域出现延迟峰值,导致 checkout-svc 的 HTTP 调用超时。
Evidence: APM 显示 Stripe /v1/payment_intents 的 p99 延迟从 800ms 升至 27s,接近 30s 超时;Stripe 状态页面确认该问题。
- highcheckout-svc 的线程池被 Stripe 调用阻塞,无隔离机制导致所有请求排队。
Evidence: 线程池利用率 100%,入站队列深度 4200,所有 200 个线程均被 Stripe 调用阻塞。
- mediumStripe 客户端缺少熔断器,导致故障时仍持续发送请求,加剧线程池耗尽。
Evidence: 配置中未设置 circuit breaker,Stripe 调用超时设为 30s,与入站超时一致,无快速失败机制。
Investigation checklist
- 确认 Stripe 状态页面最新更新
curl -s https://status.stripe.com/api/v1/incidents | jq '.incidents[] | select(.status != "resolved")'Expected: 显示 us-east-1 Connect 延迟问题仍在调查中,无预计解决时间。
- 检查 checkout-svc 当前线程池和队列深度
kubectl exec deploy/checkout-svc -- curl -s localhost:8080/actuator/threaddump | grep -E 'pool-.*thread' | head -20Expected: 显示所有线程状态为 BLOCKED 或 WAITING,且队列深度持续增长。
- 验证其他下游依赖是否正常
kubectl exec deploy/checkout-svc -- curl -s -o /dev/null -w '%{http_code}' http://auth-svc:8080/healthExpected: 返回 200,确认 auth-svc 正常。
- 检查 Stripe 客户端配置,确认无熔断器
kubectl exec deploy/checkout-svc -- cat /app/config/stripe.yaml | grep -E 'timeout|circuit|bulkhead'Expected: 仅显示 timeout: 30s,无 circuitBreaker 或 bulkhead 配置。
Mitigation plan
立即启用 Stripe 调用的熔断器,设置超时 5s 和失败阈值 50%,快速失败以保护线程池。
Risk: 熔断后所有 Stripe 支付将失败,但可避免线程池耗尽,恢复其他结账路径(如非 Stripe 支付)。
Rollback: 回滚配置,删除熔断器设置,恢复原超时 30s。
如果 Stripe 问题持续,临时禁用 Stripe 支付路径,将流量切换到备用支付提供商(如 PayPal)。
Risk: 备用支付提供商可能未充分测试,或容量不足;需确认功能开关可用。
Rollback: 重新启用 Stripe 路径,恢复原流量分配。
Customer impact
约 78% 的结账请求失败,用户看到 HTTP 504 超时错误。过去 5 分钟影响约 3000 笔交易,估计损失 GMV $180k。已确认 Stripe 是根本原因,无 ETA。
Postmortem draft
事后复盘
摘要
[填写摘要]
时间线
- 18:40 UTC: checkout-svc 开始返回 504
- 18:41: 确认 Stripe 状态页面问题
- 18:43: 讨论缓解措施
影响
- 结账成功率从 99.5% 降至 22%
- 损失 GMV $180k
根因
Stripe Connect API 延迟飙升导致线程池耗尽。
做得好的
- 快速识别 Stripe 为根因
- 状态页面确认及时
做得不好的
- 无熔断器或 bulkhead 隔离
- 无降级路径
行动项
- 为 Stripe 客户端添加熔断器
- 实现 bulkhead 隔离
- 添加备用支付提供商
Follow-ups
- P0为 Stripe 客户端添加熔断器(Resilience4j)并配置超时 5s— service owner
- P0实现线程池隔离(bulkhead)将 Stripe 调用与主线程池分离— service owner
- P1添加备用支付提供商(如 PayPal)并测试故障切换流程— platform team
- P1监控 Stripe 状态页面并自动触发告警— on-call SRE
- P2更新 runbook 包含 Stripe 故障处理步骤— service owner
Similar past incidents
lexical match (pg_trgm)
- 81%
[Eval][v1][zh] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 67%
[Eval][v2][zh] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 50%
[Eval][v1][en] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 44%
[Eval][v2][en] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 23%
[Eval][v1][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts