[Eval][v2][zh] Payment service connection pool exhaustion after batch job deploy
service: payment-svccreated: 6/2/2026, 11:48:52 PM
Raw incident context
Time: 14:02 UTC. payment-svc p99 latency jumped from 120ms to 4.8s within ~3 minutes. Error rate climbed from 0.1% to 12% (mostly HTTP 500). Application logs (last 5min): repeated "FATAL: sorry, too many clients already" and "connection refused" from payment-svc → postgres-primary. Postgres metrics: - CPU: 35% (normal) - active_connections: 500 / 500 (max_connections) - waiting_queries: 87 - slow_query_log shows a new query running every 30s: SELECT * FROM ledger_entries WHERE status='pending' (no index on status; full table scan over ~12M rows, ~2.4s per execution) Deploy history: - 13:50 UTC — payment-svc v2.41 shipped. Changelog mentions "added nightly settlement batch (cron: */30 * * * *)" - No infra/DB changes in past 7 days. On-call notes (Slack): - 14:03 — CS reports failed checkout volume rising - 14:04 — Pager: PaymentSvcErrorRateHigh - 14:06 — "DB looks healthy, CPU is fine" (initial mis-diagnosis)
Summary
payment-svc 在 14:02 UTC 出现 p99 延迟从 120ms 飙升至 4.8s,错误率从 0.1% 升至 12%,客户报告结账失败。根因是 v2.41 部署引入的定时任务(每30秒执行一次全表扫描查询)耗尽了 PostgreSQL 连接池,导致新连接被拒绝。
Severity reasoning: 用户可见的结账失败(错误率 > 1% 超过5分钟),且影响收入路径,符合 SEV1 标准。
deepseek-chat·prompt v2·output: zh·10220ms·2150↑ / 1522↓ tok·$0.00225
Root cause hypotheses
- highpayment-svc v2.41 新增的定时任务(SELECT * FROM ledger_entries WHERE status='pending')每30秒执行一次,因缺少索引导致全表扫描(约12M行,耗时2.4秒),阻塞连接池,耗尽 max_connections。
Evidence: 应用日志显示 'too many clients already';慢查询日志显示该查询每30秒运行一次,耗时2.4秒;部署时间与问题开始时间吻合。
- medium数据库连接泄漏:应用代码未正确关闭连接,导致连接数持续增长。
Evidence: active_connections 达到 500 上限,但 CPU 仅 35%,表明连接被占用而非计算密集。
- low网络问题导致 payment-svc 到数据库的连接中断或延迟。
Evidence: 应用日志出现 'connection refused',但数据库本身运行正常(CPU 35%),且无网络告警。
Investigation checklist
- 检查数据库当前连接数和等待查询
SELECT count(*) FROM pg_stat_activity WHERE state != 'idle'; SELECT count(*) FROM pg_stat_activity WHERE wait_event_type IS NOT NULL;Expected: 确认连接数接近 500,等待查询数 > 0。
- 查看慢查询日志,确认全表扫描查询
SELECT pid, query_start, now() - query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' AND query NOT LIKE '%pg_stat%' ORDER BY query_start DESC LIMIT 10;Expected: 看到 SELECT * FROM ledger_entries WHERE status='pending' 查询持续运行。
- 检查 payment-svc 部署版本和变更
kubectl rollout history deployment/payment-svc -n prod | tail -5Expected: 确认最新版本为 v2.41,部署时间约 13:50 UTC。
- 检查 payment-svc 日志中的连接错误
kubectl logs -n prod -l app=payment-svc --since=15m | grep -E 'too many clients|connection refused' | head -20Expected: 看到大量连接拒绝错误。
- 检查数据库 max_connections 设置
SHOW max_connections;Expected: 显示 500。
Mitigation plan
立即禁用导致问题的定时任务:通过配置管理或直接删除 cron 条目,阻止新查询执行。
Risk: 禁用后结算批处理暂停,但可手动触发。无数据丢失风险。
Rollback: 重新启用定时任务(恢复配置或重新添加 cron)。
临时增加数据库 max_connections 到 1000,并重启连接池以释放部分连接。
Risk: 增加连接数可能导致数据库内存压力,但当前 CPU 低,风险可控。
Rollback: 将 max_connections 改回 500 并重启数据库。
如果连接仍无法释放,强制终止阻塞查询:
Risk: 终止查询可能导致事务回滚,但无数据损坏。
Rollback: 无回滚,但可重新执行查询。
Customer impact
部分用户无法完成结账,收到 500 错误。影响范围约为总用户的 12%(基于错误率)。预计在修复后 5 分钟内恢复。
Postmortem draft
Summary
[FILL IN]
Timeline (UTC)
- 13:50 - payment-svc v2.41 部署,引入定时任务
- 14:02 - p99 延迟飙升,错误率上升
- 14:03 - CS 报告结账失败
- 14:04 - Pager 告警触发
- 14:06 - 初步误判数据库健康
- [FILL IN] - 发现根因并禁用定时任务
- [FILL IN] - 服务恢复
Impact
- p99 延迟 4.8s,错误率 12%,持续约 [FILL IN] 分钟
- 影响 [FILL IN] 用户
Root Cause
payment-svc v2.41 新增的定时任务执行全表扫描,耗尽数据库连接池。
Detection
Pager 告警和 CS 报告。
Response
- 禁用定时任务
- 增加 max_connections
- 终止阻塞查询
What Went Well
- 告警及时
- 日志清晰
What Went Poorly
- 初始误判数据库健康
- 缺少索引导致问题
Action Items
- [FILL IN]
Follow-ups
- P0为 ledger_entries.status 添加索引,防止全表扫描— 数据库团队
- P0审查 payment-svc v2.41 的变更,确保定时任务有适当的并发控制和连接管理— payment-svc 服务所有者
- P1添加数据库连接池监控告警(连接使用率 > 80%)— 可观测性团队
- P1更新部署流程,要求所有新查询在预发环境进行性能测试— 平台工程团队
- P2编写 runbook 处理数据库连接耗尽场景— SRE 值班团队
Similar past incidents
lexical match (pg_trgm)
- 92%
[Eval][v2][zh] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 83%
[Eval][v1][zh] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 48%
payment-svc DB connection storm
p99 latency 4.8s, 12% 500s, checkouts failing
- 46%
[Eval][v2][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 45%
[Eval][v1][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts