[Eval][v3][zh] Payment service connection pool exhaustion after batch job deploy
service: payment-svccreated: 6/2/2026, 11:36:18 PM
Raw incident context
Time: 14:02 UTC. payment-svc p99 latency jumped from 120ms to 4.8s within ~3 minutes. Error rate climbed from 0.1% to 12% (mostly HTTP 500). Application logs (last 5min): repeated "FATAL: sorry, too many clients already" and "connection refused" from payment-svc → postgres-primary. Postgres metrics: - CPU: 35% (normal) - active_connections: 500 / 500 (max_connections) - waiting_queries: 87 - slow_query_log shows a new query running every 30s: SELECT * FROM ledger_entries WHERE status='pending' (no index on status; full table scan over ~12M rows, ~2.4s per execution) Deploy history: - 13:50 UTC — payment-svc v2.41 shipped. Changelog mentions "added nightly settlement batch (cron: */30 * * * *)" - No infra/DB changes in past 7 days. On-call notes (Slack): - 14:03 — CS reports failed checkout volume rising - 14:04 — Pager: PaymentSvcErrorRateHigh - 14:06 — "DB looks healthy, CPU is fine" (initial mis-diagnosis)
Summary
支付服务 payment-svc 在 14:02 UTC 出现 p99 延迟从 120ms 飙升到 4.8s,错误率从 0.1% 升至 12%(HTTP 500),客户报告结账失败。根因是 v2.41 部署引入的定时任务每 30 秒执行全表扫描查询,耗尽数据库连接池。
Severity reasoning: 用户可见的结账失败(错误率 >1% 超过 5 分钟),属于 SEV1 标准。错误率 12% 远超 1% 阈值,且影响收入路径。
deepseek-chat·prompt v3·output: zh·9907ms·2401↑ / 1500↓ tok·$0.00230
Root cause hypotheses
- high新部署的定时任务每 30 秒执行全表扫描,耗尽数据库连接池
Evidence: 部署历史显示 13:50 UTC 部署 v2.41,新增 cron 任务 */30 * * * *;慢查询日志显示 SELECT * FROM ledger_entries WHERE status='pending' 每 30 秒执行一次,耗时 2.4 秒;连接数达到 max_connections=500,应用日志报 'too many clients already'
- medium数据库连接泄漏导致连接池耗尽
Evidence: 连接数达到上限 500,但慢查询日志显示大量等待查询,可能连接未及时释放。但无直接证据表明泄漏,且新查询是主要变化。
- low数据库节点故障或网络问题
Evidence: CPU 正常(35%),无网络错误日志,连接拒绝来自应用层而非数据库层。
Investigation checklist
- 确认数据库连接数是否达到上限
kubectl exec -n prod postgres-primary-0 -- psql -U postgres -c "SELECT count(*) FROM pg_stat_activity WHERE state != 'idle';"Expected: 返回接近 500 的值,确认连接池耗尽
- 查看慢查询日志,确认全表扫描查询
kubectl exec -n prod postgres-primary-0 -- psql -U postgres -c "SELECT query, calls, total_time, rows FROM pg_stat_statements ORDER BY total_time DESC LIMIT 5;"Expected: 显示 SELECT * FROM ledger_entries WHERE status='pending' 为最高耗时查询
- 检查 payment-svc 部署版本和变更
kubectl rollout history deployment/payment-svc -n prod | tail -5Expected: 显示最新版本 v2.41,部署时间 13:50 UTC
- 检查新定时任务的日志输出
kubectl logs -n prod -l app=payment-svc --since=30m | grep -i 'settlement\|batch\|cron' | head -20Expected: 显示定时任务执行日志,与慢查询时间吻合
Mitigation plan
立即禁用新部署的定时任务(回滚配置或删除 cron 条目)
Risk: 临时停止结算批处理,但不会影响核心支付功能。低风险。
Rollback: 重新启用定时任务(如果通过配置管理,恢复配置即可)
增加数据库 max_connections 到 1000 作为临时缓解
Risk: 可能增加数据库内存压力,但 CPU 空闲,风险可控。
Rollback: 恢复 max_connections 到 500
为 ledger_entries.status 列添加索引以优化查询
Risk: 添加索引可能短暂锁表,但表大小 12M 行,影响较小。
Rollback: DROP INDEX IF EXISTS idx_ledger_entries_status;
Customer impact
用户无法完成结账,收到 500 错误。影响范围:所有尝试结账的用户。预计恢复时间待定。
Postmortem draft
摘要
支付服务 payment-svc 在 14:02 UTC 出现高延迟和高错误率,持续约 [FILL IN] 分钟。根因是 v2.41 部署引入的定时任务导致数据库连接池耗尽。
时间线 (UTC)
- 13:50 — 部署 v2.41,新增定时任务
- 14:02 — 延迟飙升,错误率上升
- 14:03 — CS 报告结账失败
- 14:04 — Pager 触发
- 14:06 — 初步误判数据库健康
- [FILL IN] — 禁用定时任务,恢复
影响
- 错误率 12%,p99 延迟 4.8s
- 用户无法结账
- 估计受影响用户数 [FILL IN]
根因
新增的定时任务每 30 秒执行全表扫描,消耗大量连接,导致连接池耗尽。
检测
- 监控告警触发及时
- 但慢查询监控未配置告警
响应
- 回滚定时任务
- 增加连接数
- 添加索引
做得好的
- 快速识别连接池问题
- 部署历史可追溯
做得不好的
- 未对定时任务进行性能测试
- 缺少慢查询告警
行动项
- [ ] 为 ledger_entries.status 添加索引
- [ ] 设置慢查询告警
- [ ] 代码审查流程增加性能检查
Follow-ups
- P0为 ledger_entries.status 列添加索引— 数据库管理员
- P1设置慢查询告警(阈值 >1s)— 平台团队
- P1代码审查流程增加对定时任务和数据库查询的性能检查— 支付平台团队
- P2增加连接池监控告警(使用率 >80%)— SRE 值班
Similar past incidents
lexical match (pg_trgm)
- 78%
[Eval][v1][zh] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 76%
[Eval][v2][zh] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 45%
[Eval][v1][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 44%
[Eval][v2][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 44%
[Scenario] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts