[Eval][v2][zh] Payment service connection pool exhaustion after batch job deploy
service: payment-svccreated: 6/2/2026, 11:49:05 PM
Raw incident context
Time: 14:02 UTC. payment-svc p99 latency jumped from 120ms to 4.8s within ~3 minutes. Error rate climbed from 0.1% to 12% (mostly HTTP 500). Application logs (last 5min): repeated "FATAL: sorry, too many clients already" and "connection refused" from payment-svc → postgres-primary. Postgres metrics: - CPU: 35% (normal) - active_connections: 500 / 500 (max_connections) - waiting_queries: 87 - slow_query_log shows a new query running every 30s: SELECT * FROM ledger_entries WHERE status='pending' (no index on status; full table scan over ~12M rows, ~2.4s per execution) Deploy history: - 13:50 UTC — payment-svc v2.41 shipped. Changelog mentions "added nightly settlement batch (cron: */30 * * * *)" - No infra/DB changes in past 7 days. On-call notes (Slack): - 14:03 — CS reports failed checkout volume rising - 14:04 — Pager: PaymentSvcErrorRateHigh - 14:06 — "DB looks healthy, CPU is fine" (initial mis-diagnosis)
Summary
支付服务 payment-svc 在 14:02 UTC 出现 p99 延迟从 120ms 飙升至 4.8s,错误率从 0.1% 升至 12%(HTTP 500),客户报告结账失败。根因是 v2.41 部署新增的定时任务(每30秒执行一次全表扫描查询)耗尽了 PostgreSQL 连接池(max_connections=500),导致新连接被拒绝。
Severity reasoning: 用户可见的结账失败(错误率 >1% 超过5分钟),且影响收入路径,符合 SEV1 标准。
deepseek-chat·prompt v2·output: zh·9247ms·2150↑ / 1402↓ tok·$0.00212
Root cause hypotheses
- highpayment-svc v2.41 新增的定时任务(每30秒执行 SELECT * FROM ledger_entries WHERE status='pending')导致全表扫描,占用数据库连接并耗尽连接池。
Evidence: 部署时间 13:50 UTC 与问题开始时间吻合;慢查询日志显示该查询每30秒执行一次,耗时约2.4秒;active_connections 达到 500 上限。
- medium数据库连接泄漏:应用未正确关闭连接,导致连接数持续增长。
Evidence: active_connections 达到上限,但无直接证据表明连接未释放;CPU 正常(35%),排除计算瓶颈。
- low外部流量突增导致正常查询并发升高。
Evidence: 无流量突增报告;错误率上升与部署时间强相关,而非流量模式变化。
Investigation checklist
- 确认连接池是否被定时任务耗尽
kubectl logs -n prod -l app=payment-svc --since=15m | grep -E "too many clients|connection refused" | tail -20Expected: 显示大量连接拒绝错误,与定时任务时间点吻合
- 检查慢查询详情
SELECT pid, query_start, state, wait_event_type, query FROM pg_stat_activity WHERE state != 'idle' AND query LIKE '%ledger_entries%' ORDER BY query_start;Expected: 显示多个该查询处于 active 状态,且 query_start 时间间隔约30秒
- 验证部署变更
kubectl rollout history deployment/payment-svc -n prod | grep v2.41Expected: 显示 v2.41 在 13:50 部署
- 检查数据库连接数
SELECT count(*) FROM pg_stat_activity WHERE state != 'idle';Expected: 接近 500
Mitigation plan
立即回滚 payment-svc 至 v2.40(上一个稳定版本)
Risk: 回滚期间可能有短暂中断(约1分钟),但可恢复服务。
Rollback: 执行 kubectl rollout undo deployment/payment-svc -n prod --to-revision=<previous-revision>
临时增加数据库 max_connections 至 1000(若回滚后仍需要)
Risk: 增加连接数可能消耗更多内存,但当前 CPU 正常,风险可控。
Rollback: ALTER SYSTEM SET max_connections = 500; SELECT pg_reload_conf();
终止阻塞的慢查询以释放连接
Risk: 终止查询可能导致事务回滚,但无数据丢失风险。
Rollback: 无直接回滚,但可重新执行查询。
Customer impact
用户在进行结账时遇到失败(HTTP 500),预计影响约 12% 的结账请求。已回滚部署,服务正在恢复中。
Postmortem draft
Summary
Timeline (UTC)
- 13:50: payment-svc v2.41 部署,新增定时任务
- 14:02: p99 延迟从 120ms 升至 4.8s,错误率升至 12%
- 14:03: CS 报告结账失败
- 14:04: Pager 触发
- 14:06: 初步误判数据库健康
Impact
结账失败,影响约 12% 请求,持续约 FILL IN 分钟
Root Cause
v2.41 新增的定时任务(每30秒全表扫描 ledger_entries 表)耗尽数据库连接池,导致新连接被拒绝。
Detection
监控告警(错误率 >5%)触发,CS 报告用户问题。
Response
回滚部署,终止慢查询,增加连接池上限。
What Went Well
快速识别部署变更关联。
What Went Poorly
代码审查未发现全表扫描问题;缺乏连接池监控告警。
Action Items
Follow-ups
- P0为 ledger_entries.status 添加索引— 数据库团队
- P0审查 v2.41 定时任务代码,优化查询或增加限流— 支付服务团队
- P1添加数据库连接池使用率告警(>80%)— 可观测性团队
- P1部署流程增加慢查询扫描检查— 平台工程团队
- P2更新 postmortem 文档并安排复盘会议— 值班SRE
Similar past incidents
lexical match (pg_trgm)
- 97%
[Eval][v2][zh] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 82%
[Eval][v1][zh] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 46%
payment-svc DB connection storm
p99 latency 4.8s, 12% 500s, checkouts failing
- 46%
[Eval][v1][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 46%
[Eval][v2][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts