[Eval][v1][zh] Third-party payment gateway timeouts cascade into checkout outage
service: checkout-svccreated: 6/2/2026, 11:56:14 PM
Raw incident context
Time: 18:40 UTC. checkout-svc starts returning HTTP 504 to ~78% of checkout attempts. Symptoms: - p99 latency: 28s (hitting our 30s gateway timeout) - Success rate: 22% (down from 99.5%) - Thread pool utilization: 100% (all 200 worker threads blocked) - Inbound queue depth: 4200 requests (queue limit 5000) Downstream calls (from APM): - Stripe Connect API (/v1/payment_intents): p99 jumped from 800ms to 27s - Stripe status page: "Investigating elevated latency for Connect endpoints in us-east-1" (posted 18:35 UTC) - All other downstream deps (auth-svc, fraud-svc) normal Our config: - Stripe call timeout: 30s (matches inbound) - No circuit breaker on Stripe client - No bulkhead — Stripe calls share the main worker thread pool Customer impact: - Failed checkouts: ~3000 in 5min - Estimated lost GMV: $180k (so far) On-call: - 18:38 — pager - 18:41 — confirmed Stripe is the culprit (status page + APM) - 18:43 — debating: wait it out vs disable Stripe path entirely
Summary
checkout-svc 因依赖的 Stripe Connect API 延迟飙升(p99 从 800ms 升至 27s)导致线程池耗尽,78% 的结账请求超时失败,影响约 3000 笔交易,损失约 $180k GMV。
Severity reasoning: 用户可见的结账失败率达 78%,直接导致收入损失($180k/5min),且影响范围广(所有使用 Stripe 支付的用户),属于 SEV1。
deepseek-chat·prompt v1·output: zh·10106ms·1781↑ / 1616↓ tok·$0.00226
Root cause hypotheses
- highStripe Connect API 在 us-east-1 区域出现延迟问题,导致 checkout-svc 的 HTTP 调用超时。
Evidence: Stripe 状态页面于 18:35 UTC 发布 'Investigating elevated latency for Connect endpoints in us-east-1';APM 显示 Stripe /v1/payment_intents 的 p99 延迟从 800ms 升至 27s。
- highcheckout-svc 未配置断路器或舱壁隔离,Stripe 调用阻塞了所有工作线程。
Evidence: 线程池利用率 100%(200 个线程全部阻塞),入站队列深度 4200(接近限制 5000),且 Stripe 调用超时设为 30s 与入站超时一致,无回退机制。
- mediumStripe 客户端连接池耗尽或 DNS 解析缓慢,加剧了延迟。
Evidence: 无直接证据,但线程阻塞时连接池可能被占满;可检查连接池指标。
Investigation checklist
- 确认 Stripe 状态页面最新更新
curl -s https://status.stripe.com/api/v2/incidents/current.json | jq '.incident.status'Expected: 返回 'investigating' 或 'identified',确认 Stripe 仍在处理中
- 检查 checkout-svc 线程转储,确认阻塞线程
kubectl exec deploy/checkout-svc -- jstack $(pgrep -f checkout-svc) | grep -A 20 'BLOCKED' | head -100Expected: 大量线程处于 BLOCKED 状态,等待 Stripe HTTP 调用完成
- 查看 Stripe 客户端连接池指标
curl -s http://checkout-svc:8080/metrics | grep -E 'stripe_connections|http_client_active'Expected: active_connections 接近最大值,或连接池耗尽
- 检查入站队列深度和超时配置
kubectl exec deploy/checkout-svc -- curl -s localhost:8080/actuator/health | jq '.details.queueDepth'Expected: 队列深度接近 5000 限制,确认背压
- 验证其他下游服务是否正常
kubectl exec deploy/checkout-svc -- curl -s -o /dev/null -w '%{http_code}' http://auth-svc:8080/healthExpected: 返回 200,确认 auth-svc 正常
Mitigation plan
立即启用 Stripe 客户端的断路器,设置失败阈值(例如 5 秒内 50% 请求失败则断开),并配置快速失败回退(返回友好错误或重试队列)。
Risk: 断路器误触发可能导致正常流量也被拒绝;但当前 Stripe 已降级,风险可控。
Rollback: 回滚配置:将断路器阈值调高或禁用断路器,恢复原始配置。
临时将 Stripe 调用超时从 30s 降低到 5s,减少线程阻塞时间。
Risk: 部分正常慢请求可能被提前超时,但当前 Stripe 延迟极高,5s 超时可快速释放线程。
Rollback: 恢复超时配置为 30s。
如果 Stripe 问题持续,考虑将支付降级为仅支持非 Stripe 支付方式(如 PayPal),或在前端隐藏 Stripe 选项。
Risk: 功能降级影响用户体验,但可保住部分收入。
Rollback: 重新启用 Stripe 支付选项。
Customer impact
约 78% 的结账尝试失败,影响约 3000 笔交易,预计损失 $180k GMV。用户看到 HTTP 504 错误,无法完成购买。当前无 ETA,等待 Stripe 修复。
Postmortem draft
事后复盘
摘要
[2-3 句概述]
时间线
- 18:35 UTC: Stripe 状态页面报告延迟问题
- 18:40 UTC: checkout-svc 开始返回 504,告警触发
- 18:41 UTC: 确认 Stripe 为根因
- [添加缓解时间]
影响
- 失败结账数: ~3000
- 损失 GMV: ~$180k
- 持续时间: [待补充]
根因
Stripe Connect API 延迟飙升导致线程池耗尽。
做得好的
- 快速识别 Stripe 问题
- 状态页面确认
做得不好的
- 无断路器或舱壁
- 超时配置未与下游匹配
行动项
- [ ] 为 Stripe 客户端添加断路器
- [ ] 实现舱壁隔离
- [ ] 调整超时策略
- [ ] 添加降级支付路径
Follow-ups
- P0为 Stripe 客户端添加断路器(如 Hystrix 或 Resilience4j)— checkout-svc 负责人
- P0实现线程池舱壁隔离,将 Stripe 调用分离到独立线程池— checkout-svc 负责人
- P1调整 Stripe 调用超时策略,根据 SLA 设置更短超时并加入重试— checkout-svc 负责人
- P1添加支付降级路径,当 Stripe 不可用时自动切换备用支付— 平台团队
- P2监控 Stripe 状态页面并集成告警— on-call SRE
Similar past incidents
lexical match (pg_trgm)
- 79%
[Eval][v1][zh] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 62%
[Eval][v2][zh] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 55%
[Eval][v1][en] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 49%
[Eval][v2][en] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 24%
[Eval][v1][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts