[Eval][v2][zh] Payment service connection pool exhaustion after batch job deploy
service: payment-svccreated: 6/2/2026, 11:35:46 PM
Raw incident context
Time: 14:02 UTC. payment-svc p99 latency jumped from 120ms to 4.8s within ~3 minutes. Error rate climbed from 0.1% to 12% (mostly HTTP 500). Application logs (last 5min): repeated "FATAL: sorry, too many clients already" and "connection refused" from payment-svc → postgres-primary. Postgres metrics: - CPU: 35% (normal) - active_connections: 500 / 500 (max_connections) - waiting_queries: 87 - slow_query_log shows a new query running every 30s: SELECT * FROM ledger_entries WHERE status='pending' (no index on status; full table scan over ~12M rows, ~2.4s per execution) Deploy history: - 13:50 UTC — payment-svc v2.41 shipped. Changelog mentions "added nightly settlement batch (cron: */30 * * * *)" - No infra/DB changes in past 7 days. On-call notes (Slack): - 14:03 — CS reports failed checkout volume rising - 14:04 — Pager: PaymentSvcErrorRateHigh - 14:06 — "DB looks healthy, CPU is fine" (initial mis-diagnosis)
Summary
支付服务 payment-svc 在 14:02 UTC 出现 p99 延迟从 120ms 飙升至 4.8s,错误率从 0.1% 升至 12%(HTTP 500),客户报告结账失败。根因是 v2.41 部署引入的定时任务(每30秒全表扫描 ledger_entries 表)耗尽了 PostgreSQL 连接池(max_connections=500),导致新连接被拒绝。
Severity reasoning: 用户可见的结账失败(错误率 >1% 超过5分钟),且收入路径中断,符合 SEV1 定义。
deepseek-chat·prompt v2·output: zh·9985ms·2150↑ / 1506↓ tok·$0.00224
Root cause hypotheses
- high新部署的定时任务(*/30 * * * *)执行全表扫描 SELECT * FROM ledger_entries WHERE status='pending',无索引,每次耗时约2.4秒,占用连接并阻塞其他查询。
Evidence: 慢查询日志显示该查询每30秒运行一次,全表扫描12M行;部署时间与问题开始时间吻合(13:50 vs 14:02)。
- highPostgreSQL 连接池耗尽(active_connections=500/500),新连接被拒绝,导致 payment-svc 返回 500。
Evidence: 应用日志报错 'too many clients already' 和 'connection refused';监控显示连接数达到上限。
- low数据库 CPU 或 IO 瓶颈导致查询变慢,但 CPU 仅 35%,不太可能。
Evidence: CPU 正常(35%),无 IO 告警,慢查询仅来自新任务。
Investigation checklist
- 确认连接池是否耗尽
kubectl exec -n prod deployment/postgres-primary -- psql -U postgres -c "SELECT count(*) FROM pg_stat_activity WHERE state != 'idle';"Expected: 返回 500(已满)
- 查看慢查询详情
kubectl exec -n prod deployment/postgres-primary -- psql -U postgres -c "SELECT pid, now() - pg_stat_activity.query_start AS duration, query, state FROM pg_stat_activity WHERE state != 'idle' ORDER BY query_start DESC LIMIT 10;"Expected: 看到长时间运行的 SELECT * FROM ledger_entries WHERE status='pending'
- 检查新部署的定时任务
kubectl logs -n prod -l app=payment-svc --since=30m | grep -i 'cron\|settlement\|batch' | head -20Expected: 显示 v2.41 引入的定时任务日志
- 验证索引缺失
kubectl exec -n prod deployment/postgres-primary -- psql -U postgres -c "\d+ ledger_entries" | grep -i 'index'Expected: status 列无索引
Mitigation plan
立即终止阻塞查询并禁用定时任务:使用 pg_cancel_backend 终止慢查询,并通过配置管理移除 cron 表达式。
Risk: 终止查询可能导致事务回滚,但无数据损坏风险;禁用定时任务会暂停结算批处理,需后续手动处理。
Rollback: 重新启用定时任务(恢复 cron 配置)并重新部署。
临时增加 max_connections 至 1000 并重启 PostgreSQL(滚动重启,先备后主)。
Risk: 增加连接数可能消耗更多内存,需监控内存使用;重启主库会导致短暂不可用(秒级)。
Rollback: 将 max_connections 改回 500 并重启。
为 ledger_entries.status 创建索引以加速查询。
Risk: 创建索引会锁表(ACCESS EXCLUSIVE),可能导致短暂阻塞;建议使用 CONCURRENTLY 避免锁。
Rollback: 删除索引:DROP INDEX CONCURRENTLY IF EXISTS idx_ledger_entries_status;
Customer impact
用户无法完成结账,收到 500 错误。影响范围:所有使用 payment-svc 的客户,预计占活跃用户的 100%。当前无 ETA,正在恢复中。
Postmortem draft
Summary
[FILL IN]
Timeline (UTC)
- 13:50 - 部署 payment-svc v2.41,引入定时任务
- 14:02 - p99 延迟飙升,错误率上升
- 14:03 - CS 报告结账失败
- 14:04 - Pager 告警触发
- 14:06 - 初步误判数据库健康
- [FILL IN] - 根因确认并缓解
Impact
- 错误率 12%,持续约 [FILL IN] 分钟
- 影响所有结账用户
Root Cause
新部署的定时任务全表扫描 ledger_entries 表(无索引),耗尽连接池。
Detection
Pager 告警和 CS 报告。
Response
[FILL IN]
What Went Well
[FILL IN]
What Went Poorly
- 部署前未审查 SQL 性能
- 初始误判数据库健康
Action Items
- [FILL IN] 为 ledger_entries.status 添加索引
- [FILL IN] 添加连接池监控告警
- [FILL IN] 部署前 SQL 审查流程
Follow-ups
- P0为 ledger_entries.status 添加索引— 数据库团队
- P1添加连接池使用率告警(>80%)— 可观测性团队
- P1部署前 SQL 审查流程(慢查询检测)— 平台工程团队
- P2定时任务增加超时和连接限制— 支付服务团队
Similar past incidents
lexical match (pg_trgm)
- 92%
[Eval][v1][zh] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 86%
[Eval][v2][zh] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 50%
[Eval][v1][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 49%
[Eval][v2][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 47%
payment-svc DB connection storm
p99 latency 4.8s, 12% 500s, checkouts failing