[Eval][v3][zh] Payment service connection pool exhaustion after batch job deploy
service: payment-svccreated: 6/2/2026, 11:50:18 PM
Raw incident context
Time: 14:02 UTC. payment-svc p99 latency jumped from 120ms to 4.8s within ~3 minutes. Error rate climbed from 0.1% to 12% (mostly HTTP 500). Application logs (last 5min): repeated "FATAL: sorry, too many clients already" and "connection refused" from payment-svc → postgres-primary. Postgres metrics: - CPU: 35% (normal) - active_connections: 500 / 500 (max_connections) - waiting_queries: 87 - slow_query_log shows a new query running every 30s: SELECT * FROM ledger_entries WHERE status='pending' (no index on status; full table scan over ~12M rows, ~2.4s per execution) Deploy history: - 13:50 UTC — payment-svc v2.41 shipped. Changelog mentions "added nightly settlement batch (cron: */30 * * * *)" - No infra/DB changes in past 7 days. On-call notes (Slack): - 14:03 — CS reports failed checkout volume rising - 14:04 — Pager: PaymentSvcErrorRateHigh - 14:06 — "DB looks healthy, CPU is fine" (initial mis-diagnosis)
Summary
支付服务 payment-svc 在 14:02 UTC 出现 p99 延迟从 120ms 飙升至 4.8s,错误率从 0.1% 升至 12%,客户报告结账失败。根因是 v2.41 部署引入的定时任务每 30 秒执行全表扫描,耗尽数据库连接池。
Severity reasoning: 用户可见的结账失败,错误率 >1% 超过 5 分钟(12%),收入路径中断,符合 SEV1 标准。
deepseek-chat·prompt v3·output: zh·10068ms·2401↑ / 1552↓ tok·$0.00236
Root cause hypotheses
- highpayment-svc v2.41 新增的定时任务每 30 秒执行全表扫描,导致数据库连接池耗尽。
Evidence: 部署历史显示 13:50 UTC 部署 v2.41,新增 cron 任务 '*/30 * * * *';慢查询日志显示 SELECT * FROM ledger_entries WHERE status='pending' 每 30 秒执行一次,耗时 2.4s,无索引;数据库 active_connections 达到 500/500。
- medium数据库连接泄漏,应用未正确释放连接。
Evidence: 连接数达到上限,但应用日志无连接泄漏相关错误;慢查询导致连接长时间占用,但无直接泄漏证据。
- low数据库节点故障或网络分区导致连接积压。
Evidence: 数据库 CPU 正常(35%),无网络错误日志,无节点故障迹象。
Investigation checklist
- 检查数据库当前连接数和等待查询
SELECT count(*) FROM pg_stat_activity WHERE state != 'idle'; SELECT count(*) FROM pg_stat_activity WHERE wait_event_type IS NOT NULL;Expected: 确认连接数接近 500,等待查询数高(当前 87)。
- 确认慢查询来源和频率
SELECT query, calls, total_time, rows FROM pg_stat_statements ORDER BY total_time DESC LIMIT 5;Expected: 看到 SELECT * FROM ledger_entries WHERE status='pending' 调用次数多,总耗时高。
- 检查 payment-svc 部署版本和 cron 配置
kubectl describe deployment payment-svc -n prod | grep -A5 'Image:' ; kubectl get cronjobs -n prod -o yaml | grep -A10 'schedule:'Expected: 确认当前镜像为 v2.41,存在每 30 分钟的 cron job。
- 查看应用日志中数据库连接错误
kubectl logs -n prod -l app=payment-svc --since=30m | grep -iE 'too many clients|connection refused' | tail -20Expected: 看到大量 'FATAL: sorry, too many clients already' 错误。
Mitigation plan
立即禁用导致全表扫描的定时任务(回滚部署或删除 cron job)。
Risk: 临时禁用可能影响夜间结算批次,但可手动触发。
Rollback: 重新启用 cron job:kubectl create cronjob ... 或重新部署 v2.41。
临时增加数据库 max_connections 以恢复服务(例如设为 800)。
Risk: 增加连接数可能消耗更多内存,但 CPU 正常,风险可控。
Rollback: 恢复 max_connections 为 500:ALTER SYSTEM SET max_connections = 500; SELECT pg_reload_conf();
为 ledger_entries.status 创建索引以优化慢查询。
Risk: 创建索引可能短暂阻塞写入,但可在线创建(CONCURRENTLY)。
Rollback: 删除索引:DROP INDEX CONCURRENTLY IF EXISTS idx_ledger_entries_status;
Customer impact
用户无法完成结账,收到 HTTP 500 错误。影响范围:所有尝试结账的用户,约占总用户流量的 12%。预计恢复时间待定。
Postmortem draft
事后复盘
摘要
14:02 UTC,payment-svc 因数据库连接池耗尽导致 p99 延迟 4.8s,错误率 12%。根因是 v2.41 新增的定时任务执行全表扫描。
时间线(UTC)
- 13:50 - 部署 v2.41,引入定时任务
- 14:02 - 延迟和错误率飙升
- 14:03 - CS 报告结账失败
- 14:04 - 告警触发
- [FILL IN] - 禁用定时任务
- [FILL IN] - 服务恢复
影响
- 错误率 12%,持续约 [FILL IN] 分钟
- 受影响用户数:[FILL IN]
根因
- 直接原因:全表扫描耗尽连接池
- 触发条件:v2.41 部署
检测
- 告警:PaymentSvcErrorRateHigh 触发
- 改进:慢查询监控应提前发现
响应
- 初始误判:认为数据库健康
- 后续:禁用定时任务、增加连接数、创建索引
做得好的
- 快速定位慢查询
做得不好的
- 部署前未审查 SQL 性能
- 缺少连接池监控
行动项
- [FILL IN]
Follow-ups
- P0为 ledger_entries.status 添加索引,并监控查询性能— 数据库管理员
- P0审查 payment-svc v2.41 的定时任务代码,添加连接池限制和超时— 支付服务团队
- P1部署流水线增加 SQL 性能审查步骤(如 EXPLAIN)— 平台工程团队
- P1添加数据库连接池使用率告警(>80%)— 可观测性团队
- P2更新 runbook 以包含数据库连接耗尽的处理步骤— 值班 SRE
Similar past incidents
lexical match (pg_trgm)
- 76%
[Eval][v1][zh] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 75%
[Eval][v2][zh] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 45%
[Eval][v2][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 45%
[Scenario] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 44%
[Eval][v1][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts