[Eval][v2][zh] Third-party payment gateway timeouts cascade into checkout outage
service: checkout-svccreated: 6/2/2026, 11:57:44 PM
Raw incident context
Time: 18:40 UTC. checkout-svc starts returning HTTP 504 to ~78% of checkout attempts. Symptoms: - p99 latency: 28s (hitting our 30s gateway timeout) - Success rate: 22% (down from 99.5%) - Thread pool utilization: 100% (all 200 worker threads blocked) - Inbound queue depth: 4200 requests (queue limit 5000) Downstream calls (from APM): - Stripe Connect API (/v1/payment_intents): p99 jumped from 800ms to 27s - Stripe status page: "Investigating elevated latency for Connect endpoints in us-east-1" (posted 18:35 UTC) - All other downstream deps (auth-svc, fraud-svc) normal Our config: - Stripe call timeout: 30s (matches inbound) - No circuit breaker on Stripe client - No bulkhead — Stripe calls share the main worker thread pool Customer impact: - Failed checkouts: ~3000 in 5min - Estimated lost GMV: $180k (so far) On-call: - 18:38 — pager - 18:41 — confirmed Stripe is the culprit (status page + APM) - 18:43 — debating: wait it out vs disable Stripe path entirely
Summary
checkout-svc 因依赖的 Stripe Connect API 延迟飙升(p99 从 800ms 升至 27s)导致线程池耗尽,成功率从 99.5% 骤降至 22%,p99 延迟达 28s,约 3000 笔交易失败,预估损失 GMV $180k。当前 Stripe 状态页面已确认问题,但无 ETA。
Severity reasoning: 用户面故障:成功率降至 22%(<99%),且收入路径中断(支付失败),符合 SEV1 定义。
deepseek-chat·prompt v2·output: zh·11301ms·2162↑ / 1686↓ tok·$0.00244
Root cause hypotheses
- highStripe Connect API 在 us-east-1 区域出现延迟飙升,导致 checkout-svc 所有调用超时(30s),线程池被占满。
Evidence: APM 显示 Stripe /v1/payment_intents p99 从 800ms 升至 27s;Stripe 状态页面确认问题。
- highcheckout-svc 无断路器或 bulkhead 隔离,Stripe 故障直接耗尽共享线程池。
Evidence: 配置中无断路器,Stripe 调用与主线程池共享,线程池利用率 100%。
- mediumStripe 调用超时设置(30s)与网关超时一致,导致请求在内部排队,加剧线程阻塞。
Evidence: 入站队列深度 4200(上限 5000),p99 28s 接近超时。
- lowcheckout-svc 自身代码缺陷导致重试风暴,放大 Stripe 故障影响。
Evidence: 当前无重试日志证据,但成功率 22% 表明大量失败可能触发重试。
- lowKubernetes 节点资源争用导致 checkout-svc 性能下降。
Evidence: 其他下游服务(auth-svc, fraud-svc)正常,排除节点问题。
Investigation checklist
- 确认 Stripe 故障范围
curl -s https://status.stripe.com/api/v2/incidents/current | jq '.incident'Expected: 返回当前 incident 信息,确认 us-east-1 Connect 延迟问题
- 检查线程池和队列深度
kubectl exec -n prod deploy/checkout-svc -- curl -s localhost:8080/actuator/threadpool | jq '.threadPool.utilization, .queueDepth'Expected: utilization=100%, queueDepth 接近 5000
- 查看 Stripe 调用耗时分布
kubectl logs -n prod -l app=checkout-svc --since=10m | grep 'stripe' | awk '{print $NF}' | sort -n | tail -10Expected: 耗时接近 30s 的条目
- 检查是否有重试逻辑
kubectl logs -n prod -l app=checkout-svc --since=10m | grep -i 'retry' | head -20Expected: 无重试日志或少量重试
- 验证其他下游服务正常
kubectl logs -n prod -l app=checkout-svc --since=10m | grep -E 'auth-svc|fraud-svc' | awk '{print $NF}' | sort -n | tail -5Expected: 延迟正常(<100ms)
Mitigation plan
启用断路器:在 Stripe 客户端配置断路器,失败率达到 50% 时快速失败,避免线程池耗尽。
Risk: 断路器打开后,所有 Stripe 调用立即失败,可能导致支付功能完全不可用,但可保护系统。
Rollback: 回滚断路器配置,恢复为直连 Stripe。
实施 bulkhead:为 Stripe 调用分配独立线程池(如 20 线程),隔离故障。
Risk: 若 Stripe 恢复正常但独立线程池过小,可能限制吞吐量。
Rollback: 恢复为共享线程池配置。
降低 Stripe 超时时间至 5s,配合重试和快速失败。
Risk: 正常延迟可能偶尔超过 5s,导致误判失败,需调整重试策略。
Rollback: 恢复超时时间为 30s。
Customer impact
约 78% 的结账尝试失败,影响约 3000 笔交易,预估损失 $180k。用户看到 HTTP 504 错误或支付超时。当前无 ETA,依赖 Stripe 修复。
Postmortem draft
Summary
[FILL IN]
Timeline (UTC)
- 18:35 - Stripe 状态页面报告 us-east-1 Connect 延迟问题
- 18:38 - 收到告警,checkout-svc 成功率下降
- 18:41 - 确认 Stripe 为根因
- 18:43 - 讨论缓解方案
Impact
- 成功率从 99.5% 降至 22%
- p99 延迟 28s
- 约 3000 笔交易失败,损失 $180k
Root Cause
Stripe Connect API 延迟飙升导致 checkout-svc 线程池耗尽。
Detection
告警系统触发,人工确认 Stripe 状态页面。
Response
[FILL IN]
What Went Well
- 快速识别根因
- 状态页面信息及时
What Went Poorly
- 无断路器或 bulkhead
- 超时设置过长
Action Items
- [ ] 为 Stripe 客户端添加断路器
- [ ] 实施 bulkhead 隔离
- [ ] 调整超时策略
Follow-ups
- P0为 Stripe 客户端添加断路器(Resilience4j)— checkout-svc 团队
- P0实施 bulkhead 隔离 Stripe 调用线程池— checkout-svc 团队
- P1调整 Stripe 超时时间至 5s,并配置重试— checkout-svc 团队
- P1增加 Stripe 状态页面自动告警集成— 平台团队
- P2审查所有下游依赖的故障隔离策略— SRE 团队
Similar past incidents
lexical match (pg_trgm)
- 69%
[Eval][v1][zh] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 62%
[Eval][v2][zh] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 50%
[Eval][v1][en] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 47%
[Eval][v2][en] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 23%
[Eval][v1][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts