[Eval][v1][en] Payment service connection pool exhaustion after batch job deploy
service: payment-svccreated: 5/25/2026, 10:19:33 PM
Raw incident context
Time: 14:02 UTC. payment-svc p99 latency jumped from 120ms to 4.8s within ~3 minutes. Error rate climbed from 0.1% to 12% (mostly HTTP 500). Application logs (last 5min): repeated "FATAL: sorry, too many clients already" and "connection refused" from payment-svc → postgres-primary. Postgres metrics: - CPU: 35% (normal) - active_connections: 500 / 500 (max_connections) - waiting_queries: 87 - slow_query_log shows a new query running every 30s: SELECT * FROM ledger_entries WHERE status='pending' (no index on status; full table scan over ~12M rows, ~2.4s per execution) Deploy history: - 13:50 UTC — payment-svc v2.41 shipped. Changelog mentions "added nightly settlement batch (cron: */30 * * * *)" - No infra/DB changes in past 7 days. On-call notes (Slack): - 14:03 — CS reports failed checkout volume rising - 14:04 — Pager: PaymentSvcErrorRateHigh - 14:06 — "DB looks healthy, CPU is fine" (initial mis-diagnosis)
Summary
Payment-svc p99 latency spiked from 120ms to 4.8s with 12% HTTP 500 errors due to PostgreSQL connection exhaustion. A new cron job in v2.41 runs a full table scan on ledger_entries every 30s, consuming all 500 connections and blocking legitimate checkout queries.
Severity reasoning: User-facing outage: customers cannot complete checkouts (revenue impact). 12% error rate and high latency affect a large subset of users. Reversible by rolling back the deployment or killing the cron job.
deepseek-chat·prompt v1·output: en·11686ms
Root cause hypotheses
- highNew cron job in payment-svc v2.41 runs a full table scan on ledger_entries every 30s, exhausting DB connections.
Evidence: Deploy at 13:50 UTC added 'nightly settlement batch' cron with */30 * * * *; slow query log shows SELECT * FROM ledger_entries WHERE status='pending' (no index, 2.4s per execution) running every 30s; active_connections at max (500).
- mediumConnection pool leak in payment-svc v2.41 causes connections not to be released.
Evidence: Active connections hit max_connections (500) and waiting queries (87) suggest connections are held; but CPU is low (35%), indicating queries are not CPU-bound. Could be a pool misconfiguration.
- lowExternal traffic spike overwhelms payment-svc, causing connection buildup.
Evidence: No evidence of traffic spike in context; error rate and latency correlate exactly with deploy time. Unlikely.
Investigation checklist
- Check if the cron job is still running and consuming connections.
kubectl exec -n production deploy/payment-svc -- ps aux | grep settlementExpected: If present, shows a process running the settlement batch; if absent, no cron process.
- Identify which queries are holding connections open.
psql -h postgres-primary -c "SELECT pid, state, query_start, query FROM pg_stat_activity WHERE state != 'idle' ORDER BY query_start;"Expected: Multiple connections with the slow query 'SELECT * FROM ledger_entries WHERE status='pending'' running for long durations.
- Confirm the slow query is the one from the cron job.
kubectl logs -n production deploy/payment-svc --tail=100 | grep -i settlementExpected: Log lines showing the cron job executing the query.
- Check if rolling back the deployment resolves the issue.
kubectl rollout undo -n production deploy/payment-svcExpected: Deployment reverts to previous version; connections should drop and latency recover within minutes.
Mitigation plan
Immediately kill the cron job process to free connections.
Risk: The settlement batch will not complete; may need to be re-run later. No data loss.
Rollback: Re-run the cron job manually after fixing the query.
Roll back payment-svc to previous version (v2.40).
Risk: Rollback may cause brief additional errors during deployment. Safe as it reverts to known good state.
Rollback: Re-deploy v2.41 after fixing the issue.
If rollback not immediate, add an index on ledger_entries.status to speed up the query.
Risk: Adding index on large table may lock table briefly; consider using CONCURRENTLY.
Rollback: DROP INDEX IF EXISTS idx_ledger_entries_status;
Customer impact
Customers are experiencing failed checkouts and errors when trying to complete payments. Estimated recovery time: within 15 minutes after mitigation (rollback or killing cron).
Postmortem draft
Postmortem: Payment-svc Latency Spike and Connection Exhaustion
Date: YYYY-MM-DD Severity: SEV1 Summary: A new cron job in payment-svc v2.41 executed a full table scan on ledger_entries every 30s, exhausting PostgreSQL connections and causing 12% error rate and 4.8s p99 latency.
Timeline
- 13:50 UTC - Deploy v2.41 with settlement cron
- 14:02 UTC - Latency spike begins
- 14:04 UTC - Pager triggered
- 14:06 UTC - Initial misdiagnosis (DB CPU fine)
- [Mitigation time] - Rollback executed / cron killed
- [Recovery time] - Service restored
Impact
- 12% of checkout requests failed (HTTP 500)
- p99 latency 4.8s
- Estimated affected users: [number]
Root Cause
- New cron job ran unindexed query SELECT * FROM ledger_entries WHERE status='pending' every 30s
- Query took ~2.4s and held connections, exhausting max_connections (500)
- Legitimate checkout queries blocked, causing timeouts and errors
What Went Well
- Monitoring detected the issue quickly
- Rollback was straightforward
What Went Poorly
- No load testing or query review for new cron job
- Initial misdiagnosis delayed mitigation
- Missing index on status column
Action Items
- [ ] Add index on ledger_entries.status (P0)
- [ ] Add connection pool monitoring alert (P1)
- [ ] Require query review for any new cron job (P1)
- [ ] Improve runbook for connection exhaustion (P2)
Follow-ups
- P0Add index on ledger_entries.status to prevent full table scan.— service owner
- P1Set up alert on active_connections approaching max_connections.— platform team
- P1Implement connection pool limits per service to prevent one service from exhausting all connections.— platform team
- P1Add pre-deploy query performance review for new cron jobs.— service owner
- P2Update runbook for PostgreSQL connection exhaustion with steps to kill blocking queries.— on-call SRE
Similar past incidents
lexical match (pg_trgm)
- 77%
[Eval][v2][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 73%
[Scenario] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 50%
[Eval][v1][zh] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 45%
[Eval][v2][zh] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 38%
[datadog] checkout p99 8s
checkout p99 8s