[Eval][v1][en] Payment service connection pool exhaustion after batch job deploy
service: payment-svccreated: 6/2/2026, 11:46:55 PM
Raw incident context
Time: 14:02 UTC. payment-svc p99 latency jumped from 120ms to 4.8s within ~3 minutes. Error rate climbed from 0.1% to 12% (mostly HTTP 500). Application logs (last 5min): repeated "FATAL: sorry, too many clients already" and "connection refused" from payment-svc → postgres-primary. Postgres metrics: - CPU: 35% (normal) - active_connections: 500 / 500 (max_connections) - waiting_queries: 87 - slow_query_log shows a new query running every 30s: SELECT * FROM ledger_entries WHERE status='pending' (no index on status; full table scan over ~12M rows, ~2.4s per execution) Deploy history: - 13:50 UTC — payment-svc v2.41 shipped. Changelog mentions "added nightly settlement batch (cron: */30 * * * *)" - No infra/DB changes in past 7 days. On-call notes (Slack): - 14:03 — CS reports failed checkout volume rising - 14:04 — Pager: PaymentSvcErrorRateHigh - 14:06 — "DB looks healthy, CPU is fine" (initial mis-diagnosis)
Summary
Payment-svc p99 latency spiked from 120ms to 4.8s with 12% HTTP 500 errors due to Postgres connection pool exhaustion. Root cause is a new cron job in v2.41 running a full table scan on ledger_entries every 30s, consuming all 500 connections and blocking legitimate checkout queries.
Severity reasoning: User-facing outage: customers cannot complete checkouts (revenue impact). Scope is all payment-svc traffic. Reversibility: can be mitigated by rolling back the deploy or killing the cron job.
deepseek-chat·prompt v1·output: en·9599ms·1358↑ / 1451↓ tok·$0.00196
Root cause hypotheses
- highNew cron job in payment-svc v2.41 runs a full table scan on ledger_entries every 30s, consuming DB connections and causing connection pool exhaustion.
Evidence: Deploy at 13:50 UTC added 'nightly settlement batch (cron: */30 * * * *)'. Slow query log shows SELECT * FROM ledger_entries WHERE status='pending' (no index, 2.4s per execution). Active connections at 500/500.
- mediumConnection pool leak in payment-svc v2.41 causing connections not to be returned to pool.
Evidence: Active connections at max, but waiting queries are 87. Could be a code bug in new release.
- lowExternal traffic spike overwhelming payment-svc, leading to connection pool exhaustion.
Evidence: No reported traffic increase; error rate correlates with deploy time.
Investigation checklist
- Check if the new cron job is running and consuming connections.
kubectl exec -n production deploy/payment-svc -- ps aux | grep settlementExpected: If cron is running, you'll see a process matching the settlement batch.
- Identify the specific queries consuming connections on Postgres.
psql -h postgres-primary -c "SELECT pid, state, query_start, query FROM pg_stat_activity WHERE state = 'active' ORDER BY query_start;"Expected: Multiple active queries showing SELECT * FROM ledger_entries WHERE status='pending'.
- Check if the slow query can be killed to free connections.
psql -h postgres-primary -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE query LIKE '%ledger_entries%' AND state = 'active';"Expected: Terminates the slow queries, freeing connections. Confirm by re-running step 2.
- Verify the deploy version and rollback option.
kubectl -n production rollout history deploy/payment-svcExpected: Shows revision history; revision 2 (v2.41) is current.
Mitigation plan
Immediately kill the slow queries on Postgres to free connections.
Risk: May cause incomplete settlement batch, but no data loss as it's a read-only query.
Rollback: If needed, the cron job will restart the query on next schedule.
Roll back payment-svc to previous version (v2.40) to remove the cron job.
Risk: Rollback may cause brief downtime during pod restart. Safe as it's a stateless service.
Rollback: Re-deploy v2.41 after fixing the issue.
Add an index on ledger_entries.status to prevent full table scans.
Risk: Index creation locks the table briefly; do during low traffic or use CONCURRENTLY.
Rollback: DROP INDEX IF EXISTS idx_ledger_entries_status;
Customer impact
Customers are experiencing failed checkouts with HTTP 500 errors. Approximately 12% of checkout attempts are failing. No ETA yet; mitigation in progress.
Postmortem draft
Postmortem: Payment-svc Latency Spike and Connection Pool Exhaustion
Summary
On [date] at 14:02 UTC, payment-svc p99 latency increased from 120ms to 4.8s and error rate rose to 12%. Root cause was a new cron job in v2.41 that performed a full table scan on ledger_entries every 30s, exhausting Postgres connections.
Timeline
- 13:50 UTC - Deploy v2.41 with new cron job
- 14:02 UTC - Latency spike detected
- 14:04 UTC - Pager triggered
- 14:06 UTC - Initial misdiagnosis (DB CPU fine)
- [Add mitigation times]
Impact
- Failed checkouts for ~12% of users
- Revenue loss estimated at $X
Root Cause
New cron job executing SELECT * FROM ledger_entries WHERE status='pending' without index, causing full table scan and connection pool exhaustion.
What Went Well
- Monitoring detected the issue quickly
- Rollback was straightforward
What Went Poorly
- No index on status column
- Cron job not tested under load
- Initial misdiagnosis delayed mitigation
Action Items
- [ ] Add index on ledger_entries.status
- [ ] Add connection pool monitoring alert
- [ ] Add load testing for new cron jobs
- [ ] Review deployment process for cron jobs
Follow-ups
- P0Add index on ledger_entries.status column— DB admin
- P1Add monitoring alert for Postgres connection pool usage >80%— platform team
- P0Review and fix the cron job in payment-svc to use index or batch processing— service owner
- P1Add load testing for new cron jobs before deployment— QA team
- P2Update deployment checklist to include DB query review for new features— platform team
Similar past incidents
lexical match (pg_trgm)
- 94%
[Eval][v1][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 76%
[Eval][v2][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 71%
[Scenario] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 47%
[Eval][v1][zh] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 42%
[Eval][v2][zh] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts