[Eval][v2][zh] Third-party payment gateway timeouts cascade into checkout outage
service: checkout-svccreated: 6/2/2026, 11:38:43 PM
Raw incident context
Time: 18:40 UTC. checkout-svc starts returning HTTP 504 to ~78% of checkout attempts. Symptoms: - p99 latency: 28s (hitting our 30s gateway timeout) - Success rate: 22% (down from 99.5%) - Thread pool utilization: 100% (all 200 worker threads blocked) - Inbound queue depth: 4200 requests (queue limit 5000) Downstream calls (from APM): - Stripe Connect API (/v1/payment_intents): p99 jumped from 800ms to 27s - Stripe status page: "Investigating elevated latency for Connect endpoints in us-east-1" (posted 18:35 UTC) - All other downstream deps (auth-svc, fraud-svc) normal Our config: - Stripe call timeout: 30s (matches inbound) - No circuit breaker on Stripe client - No bulkhead — Stripe calls share the main worker thread pool Customer impact: - Failed checkouts: ~3000 in 5min - Estimated lost GMV: $180k (so far) On-call: - 18:38 — pager - 18:41 — confirmed Stripe is the culprit (status page + APM) - 18:43 — debating: wait it out vs disable Stripe path entirely
Summary
checkout-svc 因下游 Stripe Connect API 延迟飙升导致线程池耗尽,78% 的结账请求超时失败,p99 延迟 28s,成功率降至 22%。
Severity reasoning: 用户面结账功能中断,错误率 >1%(实际 78%)超过 5 分钟,且收入链路(GMV)受损,符合 SEV1 标准。
deepseek-chat·prompt v2·output: zh·10260ms·2162↑ / 1689↓ tok·$0.00244
Root cause hypotheses
- highStripe Connect API 在 us-east-1 区域出现延迟问题,导致 checkout-svc 调用超时。
Evidence: Stripe 状态页面于 18:35 UTC 发布 'Investigating elevated latency for Connect endpoints in us-east-1';APM 显示 /v1/payment_intents 的 p99 从 800ms 升至 27s。
- highcheckout-svc 未配置断路器或 bulkhead,Stripe 调用阻塞了所有 200 个工作线程,导致请求排队。
Evidence: 线程池利用率 100%,入站队列深度 4200(接近上限 5000),且无断路器保护。
- mediumStripe 超时设置(30s)与网关超时一致,导致无缓冲余地,请求堆积。
Evidence: 配置中 Stripe 调用超时 30s,与 inbound 网关超时相同,无时间余量。
- lowcheckout-svc 自身存在资源泄漏或 GC 问题,加剧了线程阻塞。
Evidence: 当前无直接证据,但线程池 100% 且队列增长,需检查 JVM 堆和 GC 日志。
Investigation checklist
- 确认 Stripe 状态页面最新更新
curl -s https://status.stripe.com/api/v2/incidents/current.json | jq '.incident.status'Expected: 返回 'monitoring' 或 'resolved' 表示 Stripe 已修复;若为 'investigating' 则问题持续。
- 检查 checkout-svc 线程转储,确认线程阻塞在 Stripe 调用上
kubectl exec -n prod deployment/checkout-svc -- jstack $(pgrep -f checkout-svc) | grep -A 20 'stripe' | head -100Expected: 大量线程处于 BLOCKED 或 WAITING 状态,堆栈指向 Stripe HTTP 调用。
- 检查 Stripe 客户端连接池状态
kubectl exec -n prod deployment/checkout-svc -- curl -s localhost:8080/actuator/metrics/http.client.requests | jq '.measurements[] | select(.statistic=="MAX")'Expected: 连接池耗尽,MAX 接近配置上限。
- 检查入站队列深度和丢弃率
kubectl exec -n prod deployment/checkout-svc -- curl -s localhost:8080/actuator/metrics/checkout.queue.depth | jq '.measurements'Expected: 队列深度接近 5000,可能有请求被丢弃。
- 检查其他下游依赖是否正常
kubectl logs -n prod -l app=checkout-svc --since=30m | grep -E 'auth-svc|fraud-svc' | tail -20Expected: auth-svc 和 fraud-svc 的调用延迟正常(<100ms),无错误。
Mitigation plan
启用 Stripe 客户端的断路器,快速失败而非阻塞线程。
Risk: 断路器打开后,所有 Stripe 调用立即失败,可能导致结账功能完全不可用,但可保护系统。
Rollback: 回滚断路器配置,恢复为直接调用。
将 Stripe 调用迁移到独立线程池(bulkhead),隔离故障。
Risk: 需要重启服务或动态调整线程池,可能短暂影响可用性。
Rollback: 恢复为共享线程池配置。
降低 Stripe 调用超时时间(如 5s),避免线程长时间阻塞。
Risk: 超时缩短可能导致更多请求失败,但可释放线程。
Rollback: 恢复超时时间为 30s。
Customer impact
约 78% 的结账请求失败,影响约 3000 笔交易,预计损失 GMV 18 万美元。用户看到 504 超时错误,无法完成购买。
Postmortem draft
Summary
[FILL IN]
Timeline (UTC)
- 18:35 - Stripe 状态页面报告 Connect API 延迟问题
- 18:38 - Pager 触发
- 18:40 - 确认 checkout-svc 大量 504 错误
- 18:41 - 确认 Stripe 为根因
- 18:43 - 讨论缓解方案
Impact
- 结账成功率从 99.5% 降至 22%
- p99 延迟 28s
- 约 3000 笔交易失败,损失 GMV $180k
Root Cause
Stripe Connect API 在 us-east-1 区域出现延迟,checkout-svc 未配置断路器或 bulkhead,导致线程池耗尽,请求超时失败。
Detection
Pager 告警触发,结合 APM 和 Stripe 状态页面确认。
Response
[FILL IN]
What Went Well
- 快速识别 Stripe 为根因
- 状态页面信息及时
What Went Poorly
- 缺乏断路器保护
- 无 bulkhead 隔离
- 超时设置无缓冲
Action Items
- [ ] 为 Stripe 客户端添加断路器
- [ ] 实现 bulkhead 隔离
- [ ] 调整超时策略
Follow-ups
- P0为 Stripe 客户端添加断路器(如 Hystrix 或 Resilience4j)— service owner
- P0实现 bulkhead 模式,隔离 Stripe 调用线程池— service owner
- P1调整 Stripe 调用超时时间,留出缓冲— service owner
- P1增加对下游依赖延迟的告警— platform team
- P2演练断路器故障切换流程— on-call SRE
Similar past incidents
lexical match (pg_trgm)
- 74%
[Eval][v1][zh] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 70%
[Eval][v2][zh] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 54%
[Eval][v1][en] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 47%
[Eval][v2][en] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 24%
[Eval][v1][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts