[Eval][v1][zh] Payment service connection pool exhaustion after batch job deploy
service: payment-svccreated: 6/2/2026, 11:47:41 PM
Raw incident context
Time: 14:02 UTC. payment-svc p99 latency jumped from 120ms to 4.8s within ~3 minutes. Error rate climbed from 0.1% to 12% (mostly HTTP 500). Application logs (last 5min): repeated "FATAL: sorry, too many clients already" and "connection refused" from payment-svc → postgres-primary. Postgres metrics: - CPU: 35% (normal) - active_connections: 500 / 500 (max_connections) - waiting_queries: 87 - slow_query_log shows a new query running every 30s: SELECT * FROM ledger_entries WHERE status='pending' (no index on status; full table scan over ~12M rows, ~2.4s per execution) Deploy history: - 13:50 UTC — payment-svc v2.41 shipped. Changelog mentions "added nightly settlement batch (cron: */30 * * * *)" - No infra/DB changes in past 7 days. On-call notes (Slack): - 14:03 — CS reports failed checkout volume rising - 14:04 — Pager: PaymentSvcErrorRateHigh - 14:06 — "DB looks healthy, CPU is fine" (initial mis-diagnosis)
Summary
支付服务 payment-svc 在 14:02 UTC 出现 p99 延迟从 120ms 飙升至 4.8s,错误率从 0.1% 升至 12%(HTTP 500),客户报告结账失败。根因是 v2.41 部署引入的定时任务(每30分钟扫描 ledger_entries 表)导致数据库连接池耗尽,查询因缺少索引执行全表扫描,阻塞其他请求。
Severity reasoning: 用户可见的结账失败(收入影响),错误率 12%,影响范围广(所有使用 payment-svc 的客户),且当前无自动恢复机制,需人工干预。
deepseek-chat·prompt v1·output: zh·11801ms·1769↑ / 1941↓ tok·$0.00261
Root cause hypotheses
- highv2.41 新增的定时任务(cron: */30 * * * *)执行全表扫描查询 SELECT * FROM ledger_entries WHERE status='pending',因 status 列无索引,扫描 1200 万行,耗时 2.4s,每30秒运行一次,导致连接长期占用。
Evidence: 慢查询日志显示该查询每30秒执行一次,耗时约2.4s;deploy 历史显示 13:50 部署 v2.41,新增该定时任务。
- high数据库连接池耗尽(max_connections=500),新连接被拒绝,导致 payment-svc 无法获取数据库连接,返回 500 错误。
Evidence: 应用日志显示 'FATAL: sorry, too many clients already' 和 'connection refused';Postgres 指标 active_connections=500/500。
- medium慢查询阻塞了其他正常查询,导致请求排队,进一步加剧延迟和错误。
Evidence: Postgres 指标 waiting_queries=87,表明大量查询在等待锁或资源。
- lowv2.41 部署本身存在内存泄漏或线程泄漏,导致连接池耗尽。
Evidence: 无直接证据,但部署后立即出现症状,需进一步排查应用指标。
Investigation checklist
- 确认数据库连接池是否耗尽
kubectl exec -n prod deployment/payment-svc -- pg_isready -h postgres-primary -U app -d paymentdbExpected: 返回 'connection refused' 或超时,确认连接池满
- 查看当前活跃连接和等待查询
psql -h postgres-primary -U postgres -c "SELECT count(*) FROM pg_stat_activity WHERE state = 'active';" -c "SELECT count(*) FROM pg_stat_activity WHERE wait_event IS NOT NULL;"Expected: 活跃连接接近500,等待查询数>0
- 检查慢查询日志,确认全表扫描查询
psql -h postgres-primary -U postgres -c "SELECT query, calls, total_time, rows FROM pg_stat_statements ORDER BY total_time DESC LIMIT 5;"Expected: 看到 SELECT * FROM ledger_entries WHERE status='pending' 总耗时最高
- 验证 ledger_entries 表 status 列是否有索引
psql -h postgres-primary -U postgres -c "\d ledger_entries"Expected: status 列没有索引(只有主键索引)
- 确认定时任务是否由 v2.41 引入
kubectl logs -n prod deployment/payment-svc --since=13:50 | grep -i 'settlement\|batch\|cron'Expected: 看到 'Starting settlement batch' 或类似日志,每30分钟出现
- 检查 payment-svc 连接池配置
kubectl exec -n prod deployment/payment-svc -- cat /app/config.yaml | grep -A5 'datasource\|hikari\|connection'Expected: 确认 max-pool-size 是否设置为500或更高
Mitigation plan
立即禁用 v2.41 引入的定时任务(通过配置或 feature flag),停止全表扫描查询。
Risk: 定时任务功能(结算批处理)暂时不可用,但无数据丢失风险。
Rollback: 重新启用定时任务(恢复配置或 flag)
临时增加数据库 max_connections 到 800(若资源允许),并重启 payment-svc 以释放连接。
Risk: 增加连接数可能消耗更多内存,需监控 DB 内存使用。
Rollback: 恢复 max_connections 为 500,并重启 DB 或 reload 配置
为 ledger_entries.status 列创建索引,加速查询。
Risk: 创建索引期间可能短暂锁表,影响写入。建议在低峰期或使用 CONCURRENTLY。
Rollback: 删除索引:DROP INDEX IF EXISTS idx_ledger_entries_status;
回滚 payment-svc 至 v2.40(上一个稳定版本)。
Risk: 回滚期间服务短暂不可用(滚动更新),但可快速恢复。
Rollback: 重新部署 v2.41(修复后)
Customer impact
客户在结账时遇到失败(HTTP 500),无法完成支付。影响所有使用 payment-svc 的用户,预计在实施缓解措施后 15-30 分钟内恢复。
Postmortem draft
Postmortem: payment-svc 高延迟与错误率
摘要
- 日期: YYYY-MM-DD
- 持续时间: 14:02 - 14:xx UTC
- 影响: 结账失败,错误率 12%
- 严重性: SEV1
时间线
- 13:50: 部署 payment-svc v2.41
- 14:02: 延迟飙升,错误率上升
- 14:04: 告警触发
- 14:06: 初步误判 DB 健康
- 14:xx: 发现慢查询和连接池耗尽
- 14:xx: 禁用定时任务,创建索引
- 14:xx: 服务恢复
影响
- 用户: 无法结账
- 指标: p99 4.8s, 错误率 12%
根因
- v2.41 引入定时任务,执行无索引的全表扫描,耗尽连接池
做得好的
- 快速识别慢查询
- 团队协作良好
做得不好的
- 初始误判 DB 健康
- 缺乏变更审查
行动项
- [ ] 为 ledger_entries.status 添加索引
- [ ] 增加连接池监控告警
- [ ] 代码审查流程要求性能评估
- [ ] 定时任务使用只读副本
Follow-ups
- P0为 ledger_entries.status 列创建索引(使用 CONCURRENTLY 避免锁表)— 数据库管理员
- P1增加数据库连接池使用率告警(>80%)— on-call SRE
- P1审查 v2.41 代码,确保定时任务使用只读副本或添加索引— 服务负责人
- P2更新部署流程,要求变更前进行性能影响评估— 平台团队
- P2为 payment-svc 添加慢查询检测和自动熔断机制— 服务负责人
- P2编写 postmortem 文档并在团队内分享— on-call SRE
Similar past incidents
lexical match (pg_trgm)
- 80%
[Eval][v1][zh] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 68%
[Eval][v2][zh] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 46%
[Eval][v1][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 45%
[Eval][v2][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 41%
[Scenario] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts