[Eval][v1][zh] Payment service connection pool exhaustion after batch job deploy
service: payment-svccreated: 6/2/2026, 11:47:54 PM
Raw incident context
Time: 14:02 UTC. payment-svc p99 latency jumped from 120ms to 4.8s within ~3 minutes. Error rate climbed from 0.1% to 12% (mostly HTTP 500). Application logs (last 5min): repeated "FATAL: sorry, too many clients already" and "connection refused" from payment-svc → postgres-primary. Postgres metrics: - CPU: 35% (normal) - active_connections: 500 / 500 (max_connections) - waiting_queries: 87 - slow_query_log shows a new query running every 30s: SELECT * FROM ledger_entries WHERE status='pending' (no index on status; full table scan over ~12M rows, ~2.4s per execution) Deploy history: - 13:50 UTC — payment-svc v2.41 shipped. Changelog mentions "added nightly settlement batch (cron: */30 * * * *)" - No infra/DB changes in past 7 days. On-call notes (Slack): - 14:03 — CS reports failed checkout volume rising - 14:04 — Pager: PaymentSvcErrorRateHigh - 14:06 — "DB looks healthy, CPU is fine" (initial mis-diagnosis)
Summary
支付服务 payment-svc 在 14:02 UTC 出现 p99 延迟从 120ms 飙升至 4.8s,错误率升至 12%,用户报告结账失败。根因是 v2.41 部署引入的定时任务(每30分钟全表扫描 ledger_entries 表)耗尽了 PostgreSQL 连接池(max_connections=500),导致新连接被拒绝。
Severity reasoning: 用户可见的结账失败(收入影响),错误率12%,影响范围广,且无快速回滚手段(需代码回滚或手动终止查询)。
deepseek-chat·prompt v1·output: zh·8967ms·1769↑ / 1445↓ tok·$0.00207
Root cause hypotheses
- highv2.41 新增的定时任务(cron: */30 * * * *)执行全表扫描 SELECT * FROM ledger_entries WHERE status='pending',无索引导致每次执行约2.4秒,占用连接并阻塞其他查询。
Evidence: 慢查询日志显示该查询每30秒执行一次,全表扫描12M行;部署时间与问题开始时间吻合。
- high连接池耗尽导致新连接被拒绝,应用层重试加剧连接竞争。
Evidence: 日志显示 'too many clients already' 和 'connection refused';active_connections 达到500上限。
- low数据库 CPU 或 IO 瓶颈导致查询变慢,但 CPU 仅35%,不太可能。
Evidence: CPU 35% 正常,无 IO 等待指标。
Investigation checklist
- 确认连接池是否耗尽
kubectl exec -n prod deploy/payment-svc -- pg_isready -h postgres-primary -U appExpected: 返回 'connection refused' 或超时,确认连接池满。
- 查看当前活跃连接和等待查询
psql -h postgres-primary -U postgres -c "SELECT count(*) FROM pg_stat_activity WHERE state = 'active';" -c "SELECT count(*) FROM pg_stat_activity WHERE wait_event IS NOT NULL;"Expected: 活跃连接接近500,等待查询数高(如87)。
- 检查慢查询日志,确认全表扫描查询
psql -h postgres-primary -U postgres -c "SELECT query, calls, total_time, rows FROM pg_stat_statements ORDER BY total_time DESC LIMIT 5;"Expected: 看到 SELECT * FROM ledger_entries WHERE status='pending' 总耗时最高。
- 验证部署变更
kubectl rollout history -n prod deploy/payment-svc --revision=latestExpected: 显示 v2.41 部署于 13:50 UTC,包含 settlement batch 变更。
Mitigation plan
立即终止阻塞查询以释放连接:在 PostgreSQL 上终止慢查询。
Risk: 终止查询可能导致事务回滚,但无数据损坏风险。
Rollback: 如果终止后问题未缓解,可重启 payment-svc pod 重置连接池。
回滚 payment-svc 至 v2.40(上一个稳定版本)。
Risk: 回滚期间可能有短暂不可用(约1分钟),但可恢复服务。
Rollback: 如果回滚失败,重新部署 v2.41 并手动禁用 cron job。
为 ledger_entries.status 添加索引以防止复发。
Risk: 添加索引可能短暂锁表,但可在低峰期执行或使用 CONCURRENTLY。
Rollback: 如果索引导致性能问题,可 DROP INDEX。
Customer impact
用户在进行结账时遇到失败,错误率12%,p99延迟4.8秒。影响持续至连接池释放或回滚完成。预计在15分钟内恢复。
Postmortem draft
事后复盘
摘要
支付服务 payment-svc 因 v2.41 部署引入的定时任务导致数据库连接池耗尽,引发高延迟和错误。
时间线
- 13:50 UTC: 部署 v2.41
- 14:02 UTC: 延迟飙升,错误率上升
- 14:04 UTC: 触发告警
- 14:06 UTC: 初步误判数据库健康
- [待补充] 终止查询/回滚时间
- [待补充] 恢复时间
影响
- 受影响用户:所有使用支付服务的用户
- 错误率:12%
- 延迟:p99 4.8s
- 持续时间:约X分钟
根因
v2.41 新增的定时任务(每30分钟)执行全表扫描,无索引,导致连接池耗尽。
做得好的
- 告警及时触发
- 日志提供了明确线索
做得不好的
- 代码审查未发现全表扫描风险
- 初始误判数据库健康
行动项
- [ ] 为 ledger_entries.status 添加索引
- [ ] 代码审查增加对全表扫描的检查
- [ ] 增加连接池使用率告警
- [ ] 部署前进行性能测试
Follow-ups
- P0为 ledger_entries.status 添加索引— 数据库管理员
- P1审查 v2.41 代码,确保定时任务不会再次引发问题— 服务负责人
- P1添加连接池使用率告警(>80%)— 平台团队
- P2更新部署流程,要求包含性能测试结果— SRE on-call
Similar past incidents
lexical match (pg_trgm)
- 84%
[Eval][v1][zh] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 82%
[Eval][v2][zh] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 50%
[Eval][v2][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 49%
[Eval][v1][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 47%
[Scenario] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts