payment-svc DB connection storm
service: payment-svccreated: 5/25/2026, 9:07:28 PM
Raw incident context
Time: 14:02 UTC. payment-svc p99 latency jumped from 120ms to 4.8s. Error rate climbed from 0.1% to 12% (mostly 500s). Logs show: "connection refused" from payment-svc -> postgres-primary. Postgres CPU normal (35%), but active_connections = 500 / 500. Recent change: payment-svc v2.41 deployed 13:50 UTC, added a new batch job. On-call notes: customers reporting failed checkouts; CS queue spiking.
Summary
支付服务 payment-svc 在 14:02 UTC 出现 p99 延迟从 120ms 飙升到 4.8s,错误率从 0.1% 升至 12%(500 错误),导致用户结账失败。根因是 v2.41 部署引入的新批处理作业耗尽了 PostgreSQL 连接池(500/500),新请求被拒绝。
Severity reasoning: 用户可见的结账失败(支付路径中断),错误率 >1% 超过 5 分钟,符合 SEV1 标准。
deepseek-chat·prompt v2·output: zh·10427ms
Root cause hypotheses
- highpayment-svc v2.41 新增的批处理作业占用了所有数据库连接,导致正常请求无法获取连接。
Evidence: 日志显示 'connection refused' 从 payment-svc 到 postgres-primary,且 active_connections = 500/500。部署时间 13:50 UTC 与问题开始时间吻合。
- medium数据库连接泄漏,连接未被正确释放。
Evidence: 连接数达到上限 500,但 CPU 正常(35%),可能连接被挂起未关闭。
- low数据库主节点故障或网络分区。
Evidence: CPU 正常,无网络错误日志,可能性低。
Investigation checklist
- 检查 payment-svc 到 postgres-primary 的连接池状态
kubectl exec -n prod deploy/payment-svc -- pgrep -a batch-job || echo 'no batch job found'; kubectl logs -n prod -l app=payment-svc --since=15m | grep -i 'connection refused' | head -20Expected: 确认批处理作业进程存在,并看到大量连接拒绝日志。
- 检查 PostgreSQL 当前活跃连接和等待事件
psql -h postgres-primary -U postgres -c "SELECT pid, query_start, state, wait_event_type, query FROM pg_stat_activity WHERE state != 'idle' ORDER BY query_start LIMIT 20;"Expected: 看到大量来自 payment-svc 的查询处于 active 状态,可能都是批处理作业的查询。
- 检查批处理作业的查询是否长时间运行
psql -h postgres-primary -U postgres -c "SELECT pid, now() - query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' AND query NOT LIKE '%pg_stat_activity%' ORDER BY duration DESC LIMIT 10;"Expected: 批处理作业的查询运行时间很长,可能未优化。
- 检查 payment-svc 部署的变更内容
kubectl rollout history deployment/payment-svc -n prod | tail -5; kubectl diff -f /tmp/v2.41-manifest.yaml 2>/dev/null || echo 'no diff available'Expected: 确认 v2.41 新增了批处理作业配置。
Mitigation plan
立即回滚 payment-svc 到 v2.40,移除批处理作业。
Risk: 回滚期间可能有短暂连接中断,但比当前故障影响小。
Rollback: 重新部署 v2.41(但需先修复连接池问题)。
临时增加 PostgreSQL 最大连接数到 1000,并重启数据库以应用配置。
Risk: 增加连接数可能导致内存压力,但可快速恢复服务。重启会导致短暂不可用。
Rollback: 恢复 max_connections 到 500 并重启。
在 payment-svc 中限制批处理作业的并发连接数,例如使用连接池上限。
Risk: 需要代码变更,无法立即生效。
Rollback: 回滚代码变更。
Customer impact
用户无法完成结账,支付失败。影响所有尝试结账的用户,CS 队列正在积压。预计在回滚后 5 分钟内恢复。
Postmortem draft
Summary
支付服务 payment-svc 因 v2.41 部署引入的批处理作业耗尽数据库连接池,导致 p99 延迟飙升、错误率上升,用户结账失败。
Timeline (UTC)
- 13:50 - payment-svc v2.41 部署
- 14:02 - 监控告警触发,p99 延迟 4.8s,错误率 12%
- [FILL IN] - 开始调查
- [FILL IN] - 回滚至 v2.40
- [FILL IN] - 服务恢复
Impact
- 用户结账失败,影响时间约 [FILL IN] 分钟
- 错误率 12%,p99 延迟 4.8s
Root Cause
payment-svc v2.41 新增的批处理作业未限制数据库连接数,占用了所有 500 个连接,导致正常请求被拒绝。
Detection
监控告警(延迟和错误率)触发。
Response
- 回滚部署
- 增加数据库连接数
- 限制批处理作业并发
What Went Well
- 监控及时告警
- 回滚操作迅速
What Went Poorly
- 变更未经过连接池压力测试
- 批处理作业缺乏资源限制
Action Items
- [FILL IN] 添加连接池限制
- [FILL IN] 增加变更审查流程
Follow-ups
- P0为 payment-svc 添加连接池限制,防止单个作业耗尽连接。— service owner
- P1增加数据库连接数监控和告警,接近上限时预警。— platform team
- P1审查 v2.41 的变更流程,确保批处理作业经过性能测试。— service owner
- P2在部署流水线中添加连接池压力测试步骤。— platform team
Similar past incidents
lexical match (pg_trgm)
- 48%
[Eval][v1][zh] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 48%
[Eval][v2][zh] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 26%
[Scenario] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 25%
[Eval][v1][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 25%
[Eval][v2][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
1 earlier analysis version(s)
- v2zhSEV1·deepseek-chat·5/25/2026, 9:07:29 PM·11000ms