[Eval][v3][zh] Payment service connection pool exhaustion after batch job deploy
service: payment-svccreated: 6/2/2026, 11:50:33 PM
Raw incident context
Time: 14:02 UTC. payment-svc p99 latency jumped from 120ms to 4.8s within ~3 minutes. Error rate climbed from 0.1% to 12% (mostly HTTP 500). Application logs (last 5min): repeated "FATAL: sorry, too many clients already" and "connection refused" from payment-svc → postgres-primary. Postgres metrics: - CPU: 35% (normal) - active_connections: 500 / 500 (max_connections) - waiting_queries: 87 - slow_query_log shows a new query running every 30s: SELECT * FROM ledger_entries WHERE status='pending' (no index on status; full table scan over ~12M rows, ~2.4s per execution) Deploy history: - 13:50 UTC — payment-svc v2.41 shipped. Changelog mentions "added nightly settlement batch (cron: */30 * * * *)" - No infra/DB changes in past 7 days. On-call notes (Slack): - 14:03 — CS reports failed checkout volume rising - 14:04 — Pager: PaymentSvcErrorRateHigh - 14:06 — "DB looks healthy, CPU is fine" (initial mis-diagnosis)
Summary
支付服务 payment-svc 在 14:02 UTC 出现 p99 延迟从 120ms 飙升到 4.8s,错误率从 0.1% 升至 12%,客户报告结账失败。根因是 v2.41 部署引入的定时任务每 30 秒执行全表扫描,耗尽数据库连接池。
Severity reasoning: 用户可见的结账失败,错误率 >1% 超过 5 分钟(12%),且影响收入路径,符合 SEV1 标准。
deepseek-chat·prompt v3·output: zh·10109ms·2401↑ / 1446↓ tok·$0.00224
Root cause hypotheses
- high新部署的定时任务每30秒执行全表扫描,导致数据库连接池耗尽
Evidence: 慢查询日志显示 SELECT * FROM ledger_entries WHERE status='pending' 每30秒执行一次,耗时2.4秒,无索引;active_connections 达到 max_connections 500;应用日志报 'too many clients already'。
- medium数据库连接泄漏,连接未正确释放
Evidence: 连接数持续满,但无直接证据表明泄漏;CPU 正常,排除计算瓶颈。
- low数据库节点故障或网络问题
Evidence: CPU 35% 正常,无网络错误日志,连接拒绝来自应用层而非网络层。
Investigation checklist
- 确认数据库连接池是否已满
kubectl exec -n prod postgres-primary-0 -- psql -U postgres -c "SELECT count(*) FROM pg_stat_activity WHERE state != 'idle';"Expected: 返回接近 500,确认连接池耗尽
- 查看慢查询,确认全表扫描
kubectl exec -n prod postgres-primary-0 -- psql -U postgres -c "SELECT query, calls, total_time, rows FROM pg_stat_statements ORDER BY total_time DESC LIMIT 5;"Expected: 显示 SELECT * FROM ledger_entries WHERE status='pending' 总耗时最高
- 检查新部署的定时任务
kubectl get cronjobs -n prod | grep payment-svcExpected: 显示名为 payment-svc-settlement 的 cronjob,调度为 */30 * * * *
- 查看应用日志确认连接拒绝
kubectl logs -n prod -l app=payment-svc --since=10m | grep -E 'too many clients|connection refused' | tail -20Expected: 大量 'FATAL: sorry, too many clients already' 日志
Mitigation plan
立即暂停定时任务 cronjob,释放连接
Risk: 暂停后结算批处理停止,但可手动触发;无数据丢失风险
Rollback: 重新启用 cronjob: kubectl patch cronjob payment-svc-settlement -n prod -p '{"spec":{"suspend":false}}'
临时增加数据库 max_connections 到 800
Risk: 可能增加数据库内存压力,但 CPU 空闲,风险可控
Rollback: 改回 500: kubectl exec -n prod postgres-primary-0 -- psql -U postgres -c "ALTER SYSTEM SET max_connections = 500; SELECT pg_reload_conf();"
为 ledger_entries.status 创建索引
Risk: 创建索引期间可能锁表,影响写入;建议在 replica 上先测试
Rollback: 删除索引: DROP INDEX IF EXISTS idx_ledger_entries_status;
Customer impact
用户无法完成结账,收到 HTTP 500 错误。影响范围:所有使用 payment-svc 的客户,约占全量用户的 100%。预计恢复时间待定。
Postmortem draft
摘要
支付服务 payment-svc 因新部署的定时任务导致数据库连接池耗尽,引发高延迟和错误率。
时间线 (UTC)
- 13:50 - payment-svc v2.41 部署,引入结算批处理 cronjob
- 14:02 - p99 延迟从 120ms 升至 4.8s,错误率升至 12%
- 14:03 - CS 报告结账失败
- 14:04 - Pager 触发
- [FILL IN] - 暂停 cronjob 后恢复
影响
- 所有用户结账失败,持续约 [FILL IN] 分钟
- 无数据丢失
根因
- 新 cronjob 每30秒执行无索引的全表扫描,消耗连接池
检测
- 慢查询日志和连接数监控及时告警
响应
- 暂停 cronjob 并增加连接数
做得好的
- 监控告警及时
- 日志清晰
做得不好的
- 部署前未进行性能测试
- 初始误判数据库健康
行动项
- [FILL IN]
Follow-ups
- P0为 ledger_entries.status 添加索引— 数据库管理员
- P0审查 payment-svc v2.41 的变更,确保定时任务有连接池限制— 支付服务团队
- P1增加数据库连接池使用率告警— 可观测性团队
- P1部署前增加性能测试,模拟慢查询场景— QA 团队
- P2更新 runbook,包含连接池耗尽处理步骤— SRE 团队
Similar past incidents
lexical match (pg_trgm)
- 77%
[Eval][v1][zh] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 76%
[Eval][v2][zh] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 45%
[Eval][v2][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 45%
[Scenario] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 45%
payment-svc DB connection storm
p99 latency 4.8s, 12% 500s, checkouts failing