[Eval][v1][zh] Third-party payment gateway timeouts cascade into checkout outage
service: checkout-svccreated: 6/2/2026, 11:56:41 PM
Raw incident context
Time: 18:40 UTC. checkout-svc starts returning HTTP 504 to ~78% of checkout attempts. Symptoms: - p99 latency: 28s (hitting our 30s gateway timeout) - Success rate: 22% (down from 99.5%) - Thread pool utilization: 100% (all 200 worker threads blocked) - Inbound queue depth: 4200 requests (queue limit 5000) Downstream calls (from APM): - Stripe Connect API (/v1/payment_intents): p99 jumped from 800ms to 27s - Stripe status page: "Investigating elevated latency for Connect endpoints in us-east-1" (posted 18:35 UTC) - All other downstream deps (auth-svc, fraud-svc) normal Our config: - Stripe call timeout: 30s (matches inbound) - No circuit breaker on Stripe client - No bulkhead — Stripe calls share the main worker thread pool Customer impact: - Failed checkouts: ~3000 in 5min - Estimated lost GMV: $180k (so far) On-call: - 18:38 — pager - 18:41 — confirmed Stripe is the culprit (status page + APM) - 18:43 — debating: wait it out vs disable Stripe path entirely
Summary
checkout-svc 因 Stripe Connect API 在 us-east-1 区域出现高延迟,导致线程池耗尽,78% 的结账请求超时失败,p99 延迟飙升至 28s,成功率从 99.5% 骤降至 22%。
Severity reasoning: 用户可见的结账失败率高达 78%,直接影响收入(估计已损失 $180k),且无降级路径,符合 SEV1 标准。
deepseek-chat·prompt v1·output: zh·8457ms·1781↑ / 1266↓ tok·$0.00187
Root cause hypotheses
- highStripe Connect API 在 us-east-1 区域出现延迟峰值,导致 checkout-svc 的 HTTP 调用超时,线程被阻塞等待响应。
Evidence: APM 显示 Stripe /v1/payment_intents 的 p99 延迟从 800ms 升至 27s,Stripe 状态页确认 us-east-1 区域问题。
- highcheckout-svc 未对 Stripe 调用设置熔断器或隔离机制,导致所有工作线程被慢调用耗尽。
Evidence: 线程池利用率 100%,入站队列深度 4200,且配置中无 circuit breaker 或 bulkhead。
- mediumStripe 客户端连接池配置不足,导致连接等待加剧延迟。
Evidence: 无直接证据,但常见于高并发场景,需检查连接池配置。
Investigation checklist
- 确认 Stripe 调用延迟是否仍高
curl -w '%{time_total}' -o /dev/null -s https://api.stripe.com/v1/payment_intents -H 'Authorization: Bearer <token>'Expected: 若 time_total 仍 > 5s,说明 Stripe 问题未恢复
- 检查线程池状态
curl -s http://checkout-svc:8080/actuator/threaddump | grep 'pool-' | head -20Expected: 大部分线程状态为 BLOCKED 或 WAITING
- 检查入站队列深度
curl -s http://checkout-svc:8080/actuator/metrics/checkout.queue.depth | jq '.measurements[0].value'Expected: 值接近 5000(队列上限)
- 检查 Stripe 客户端连接池配置
kubectl exec deploy/checkout-svc -- cat /app/config/application.yml | grep -A5 'stripe'Expected: 确认 maxConnections 和 timeout 设置
Mitigation plan
启用 Stripe 支付路径的熔断器,快速失败而非阻塞线程
Risk: 部分用户支付失败,但可降级为其他支付方式(如 PayPal)
Rollback: 移除熔断器配置,恢复原始调用
将 Stripe 调用迁移至独立线程池(bulkhead),隔离影响
Risk: 需重启服务,可能短暂中断
Rollback: 回滚部署,恢复共享线程池
Customer impact
约 78% 的结账请求失败,用户看到 504 超时错误。过去 5 分钟约 3000 笔交易失败,估计损失 $180k。当前无 ETA,取决于 Stripe 恢复时间。
Postmortem draft
事后复盘:checkout-svc 结账失败
摘要
时间线
- 18:40 UTC: 告警触发
- 18:41: 确认 Stripe 问题
- 18:43: 讨论缓解方案
影响
- 成功率 22%,p99 28s
- 损失 $180k GMV
根因
- Stripe Connect API 延迟飙升
- 缺乏熔断器和隔离机制
做得好的
- 快速识别外部依赖问题
做得不好的
- 未配置熔断器
- 共享线程池无隔离
行动项
- [ ] 为 Stripe 添加熔断器
- [ ] 实现 bulkhead 隔离
- [ ] 增加 Stripe 延迟告警
Follow-ups
- P0为 Stripe 客户端添加熔断器(如 resilience4j)— service owner
- P0实现 bulkhead 隔离,Stripe 调用使用独立线程池— service owner
- P1增加 Stripe API 延迟告警(阈值 5s)— on-call SRE
- P1审查所有外部依赖的熔断和隔离配置— platform team
- P2更新 runbook 记录 Stripe 降级步骤— on-call SRE
Similar past incidents
lexical match (pg_trgm)
- 82%
[Eval][v1][zh] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 77%
[Eval][v2][zh] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 52%
[Eval][v1][en] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 46%
[Eval][v2][en] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 24%
[Eval][v1][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts