[Eval][v1][zh] Payment service connection pool exhaustion after batch job deploy
service: payment-svccreated: 6/2/2026, 11:35:10 PM
Raw incident context
Time: 14:02 UTC. payment-svc p99 latency jumped from 120ms to 4.8s within ~3 minutes. Error rate climbed from 0.1% to 12% (mostly HTTP 500). Application logs (last 5min): repeated "FATAL: sorry, too many clients already" and "connection refused" from payment-svc → postgres-primary. Postgres metrics: - CPU: 35% (normal) - active_connections: 500 / 500 (max_connections) - waiting_queries: 87 - slow_query_log shows a new query running every 30s: SELECT * FROM ledger_entries WHERE status='pending' (no index on status; full table scan over ~12M rows, ~2.4s per execution) Deploy history: - 13:50 UTC — payment-svc v2.41 shipped. Changelog mentions "added nightly settlement batch (cron: */30 * * * *)" - No infra/DB changes in past 7 days. On-call notes (Slack): - 14:03 — CS reports failed checkout volume rising - 14:04 — Pager: PaymentSvcErrorRateHigh - 14:06 — "DB looks healthy, CPU is fine" (initial mis-diagnosis)
Summary
支付服务 payment-svc 在 14:02 UTC 出现 p99 延迟从 120ms 飙升至 4.8s,错误率升至 12%,客户报告结账失败。根因是 v2.41 部署引入的定时任务每30秒全表扫描 ledger_entries 表(无索引),耗尽数据库连接池(500/500),导致正常查询被拒绝。
Severity reasoning: 用户可见的结账失败(12% 错误率),直接影响收入,且问题由代码变更引入,回滚可恢复,符合 SEV1 定义。
deepseek-chat·prompt v1·output: zh·10448ms·1769↑ / 1527↓ tok·$0.00216
Root cause hypotheses
- highv2.41 新增的定时任务每30秒执行全表扫描查询,耗尽数据库连接池。
Evidence: 慢查询日志显示 SELECT * FROM ledger_entries WHERE status='pending' 每30秒执行一次,耗时2.4秒,无索引;部署时间与故障时间吻合。
- medium数据库 max_connections 配置过低,无法处理正常流量与定时任务的并发。
Evidence: active_connections 达到 500/500,但 CPU 仅 35%,表明瓶颈在连接数而非计算资源。
- low定时任务代码存在连接泄漏,未正确释放数据库连接。
Evidence: 连接数持续满额,但无直接证据;需检查代码中连接池使用模式。
Investigation checklist
- 确认数据库连接池是否被定时任务耗尽
kubectl exec -n prod deploy/payment-svc -- pg_isready -h postgres-primary -U app # 检查数据库可达性Expected: 返回 'accepting connections' 或 'no response';若超时则确认连接池耗尽
- 查看慢查询日志,确认全表扫描查询
kubectl exec -n prod deploy/postgres-primary -- psql -c "SELECT query, calls, total_time, rows FROM pg_stat_statements ORDER BY total_time DESC LIMIT 5;"Expected: 显示 SELECT * FROM ledger_entries WHERE status='pending' 总耗时最高,且 rows 接近表行数
- 检查定时任务是否在运行
kubectl logs -n prod deploy/payment-svc --since=10m | grep -i 'settlement\|batch\|cron'Expected: 显示 'Running settlement batch' 或类似日志,每30秒出现一次
- 确认部署版本与变更
kubectl rollout history -n prod deploy/payment-svc | tail -5Expected: 显示 revision 对应 v2.41,部署时间约 13:50 UTC
- 检查数据库连接数是否达到上限
kubectl exec -n prod deploy/postgres-primary -- psql -c "SELECT count(*) FROM pg_stat_activity;"Expected: 返回 500(等于 max_connections)
Mitigation plan
立即回滚 payment-svc 至 v2.40,移除定时任务。
Risk: 回滚期间可能有短暂不可用(约1分钟),但比当前状态安全。
Rollback: 重新部署 v2.41(kubectl rollout undo -n prod deploy/payment-svc --to-revision=<previous>)
若回滚不可行,临时增加数据库 max_connections 并添加索引。
Risk: 增加连接数可能耗尽内存或导致 OOM;添加索引可能短暂锁表。
Rollback: 恢复 max_connections 原值(500);删除索引(DROP INDEX IF EXISTS idx_ledger_entries_status;)
手动终止阻塞的定时任务查询。
Risk: 终止查询可能导致事务回滚,但无数据损坏风险。
Rollback: 无,终止不可逆,但可重新执行任务。
Customer impact
约12%的结账请求失败,用户看到500错误,无法完成支付。影响持续至问题解决。预计回滚后5分钟内恢复。
Postmortem draft
事后复盘:支付服务高延迟与错误率
摘要
[2-3句概述]
时间线
- 13:50 UTC: 部署 v2.41,引入定时任务
- 14:02 UTC: 延迟飙升,错误率上升
- 14:04 UTC: 告警触发
影响
- 受影响用户:所有使用支付服务的用户
- 错误率:12%
- 延迟:p99 4.8s
根因
v2.41 新增的定时任务每30秒全表扫描 ledger_entries 表(无索引),耗尽数据库连接池,导致正常查询被拒绝。
做得好的
- 快速识别部署变更
- 慢查询日志提供了明确证据
做得不好的
- 初始误判数据库健康(仅看CPU)
- 代码审查未发现全表扫描风险
行动项
- [ ] 为 ledger_entries.status 添加索引
- [ ] 定时任务增加连接池限制和超时
- [ ] 部署前进行性能测试
- [ ] 增加数据库连接池监控告警
Follow-ups
- P0为 ledger_entries.status 列添加索引,防止全表扫描— 数据库管理员
- P0审查 v2.41 定时任务代码,增加连接池限制和重试机制— 服务所有者
- P1增加数据库连接池使用率告警(>80%)— 平台团队
- P1部署流水线增加慢查询检测门禁— 平台团队
- P2更新事后复盘文档并安排复盘会议— 值班SRE
Similar past incidents
lexical match (pg_trgm)
- 85%
[Eval][v1][zh] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 71%
[Eval][v2][zh] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 47%
[Eval][v2][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 46%
[Eval][v1][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 43%
[Scenario] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts