[Eval][v1][zh] Payment service connection pool exhaustion after batch job deploy
service: payment-svccreated: 6/2/2026, 11:47:25 PM
Raw incident context
Time: 14:02 UTC. payment-svc p99 latency jumped from 120ms to 4.8s within ~3 minutes. Error rate climbed from 0.1% to 12% (mostly HTTP 500). Application logs (last 5min): repeated "FATAL: sorry, too many clients already" and "connection refused" from payment-svc → postgres-primary. Postgres metrics: - CPU: 35% (normal) - active_connections: 500 / 500 (max_connections) - waiting_queries: 87 - slow_query_log shows a new query running every 30s: SELECT * FROM ledger_entries WHERE status='pending' (no index on status; full table scan over ~12M rows, ~2.4s per execution) Deploy history: - 13:50 UTC — payment-svc v2.41 shipped. Changelog mentions "added nightly settlement batch (cron: */30 * * * *)" - No infra/DB changes in past 7 days. On-call notes (Slack): - 14:03 — CS reports failed checkout volume rising - 14:04 — Pager: PaymentSvcErrorRateHigh - 14:06 — "DB looks healthy, CPU is fine" (initial mis-diagnosis)
Summary
支付服务 payment-svc 在 14:02 UTC 出现 p99 延迟从 120ms 飙升到 4.8s,错误率升至 12%,客户报告结账失败。根本原因是新部署的 v2.41 引入了一个每 30 秒运行的全表扫描查询,耗尽了 PostgreSQL 连接池(500/500),导致新连接被拒绝。
Severity reasoning: 用户可见的结账失败(12% 错误率),直接影响收入,属于 SEV1。
deepseek-chat·prompt v1·output: zh·9655ms·1769↑ / 1565↓ tok·$0.00220
Root cause hypotheses
- high新部署的 payment-svc v2.41 引入了定时任务,每 30 秒执行全表扫描查询 SELECT * FROM ledger_entries WHERE status='pending',该查询无索引,耗时约 2.4 秒,占用了大量数据库连接,导致连接池耗尽。
Evidence: 部署时间 13:50 UTC 与延迟飙升时间 14:02 UTC 吻合;慢查询日志显示该查询每 30 秒执行一次;active_connections 达到 500 上限。
- low数据库连接泄漏:新代码中可能存在未正确关闭的连接,导致连接数持续增长。
Evidence: 连接数达到上限,但慢查询日志中只有一条新查询,泄漏可能性较低。
- low外部流量突增导致正常查询变慢,但连接池未扩容。
Evidence: 没有流量突增的报告,且 CPU 正常,不支持此假设。
Investigation checklist
- 确认连接池是否已满
kubectl exec -n prod deployment/payment-svc -- pg_isready -h postgres-primary -U appExpected: 返回 'connection refused' 或超时,表明连接池已满。
- 查看慢查询日志,确认全表扫描查询
kubectl exec -n prod deployment/payment-svc -- psql -h postgres-primary -U app -c "SELECT query, calls, total_time, rows FROM pg_stat_statements ORDER BY total_time DESC LIMIT 5;"Expected: 看到 SELECT * FROM ledger_entries WHERE status='pending' 总耗时最高。
- 检查该查询的执行计划,确认缺少索引
kubectl exec -n prod deployment/payment-svc -- psql -h postgres-primary -U app -c "EXPLAIN ANALYZE SELECT * FROM ledger_entries WHERE status='pending';"Expected: 显示 Seq Scan on ledger_entries,预计行数约 12M,实际耗时约 2.4 秒。
- 查看当前活跃连接数和等待查询数
kubectl exec -n prod deployment/payment-svc -- psql -h postgres-primary -U app -c "SELECT count(*) FROM pg_stat_activity WHERE state = 'active'; SELECT count(*) FROM pg_stat_activity WHERE wait_event IS NOT NULL;"Expected: 活跃连接接近 500,等待查询数 87。
- 检查 deployment 变更历史,确认新版本引入的定时任务
kubectl rollout history deployment/payment-svc -n prodExpected: 显示 revision 2.41 在 13:50 UTC 部署。
Mitigation plan
立即回滚 payment-svc 到上一个稳定版本 v2.40,以移除导致全表扫描的定时任务。
Risk: 回滚期间可能有短暂的服务中断(约 1-2 分钟),但比当前故障影响小。
Rollback: 如果回滚后问题未解决,重新部署 v2.41 并禁用定时任务(通过环境变量或配置)。
在回滚前,临时增加 PostgreSQL max_connections 到 1000 以缓解连接压力,但需注意资源限制。
Risk: 增加连接数可能导致数据库内存压力增大,但 CPU 目前正常,风险可控。
Rollback: 将 max_connections 改回 500 并重启数据库。
在 ledger_entries.status 列上创建索引,以加速查询。
Risk: 创建索引期间可能对写操作有短暂影响,但该表为只读查询,风险低。
Rollback: DROP INDEX IF EXISTS idx_ledger_entries_status;
Customer impact
约 12% 的结账请求失败,用户看到 HTTP 500 错误,无法完成支付。预计在回滚后 5 分钟内恢复。
Postmortem draft
事后复盘:payment-svc 连接池耗尽故障
摘要
[2-3 句概述]
时间线
- 13:50 UTC: 部署 payment-svc v2.41
- 14:02 UTC: p99 延迟飙升,错误率上升
- 14:04 UTC: 触发告警
- [回滚时间]: 回滚至 v2.40
影响
- 受影响用户:所有使用支付服务的用户
- 错误率:12%
- 持续时间:约 [X] 分钟
根本原因
新版本引入的定时任务执行全表扫描,耗尽数据库连接池。
做得好的
- 监控告警及时
- 回滚操作迅速
做得不好的
- 代码审查未发现全表扫描
- 缺乏数据库连接池监控告警
行动项
- [ ] 为 ledger_entries.status 添加索引
- [ ] 添加连接池使用率告警
- [ ] 代码审查要求新查询必须有索引
- [ ] 部署前进行性能测试
Follow-ups
- P0为 ledger_entries.status 列添加索引— 数据库管理员
- P1添加数据库连接池使用率告警(>80%)— 平台团队
- P1代码审查流程增加对全表扫描的检查— 服务所有者
- P2部署前性能测试要求包含慢查询检测— QA 团队
Similar past incidents
lexical match (pg_trgm)
- 78%
[Eval][v2][zh] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 72%
[Eval][v1][zh] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 46%
[Eval][v1][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 46%
[Eval][v2][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 45%
[Scenario] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts