[Eval][v1][en] Payment service connection pool exhaustion after batch job deploy
service: payment-svccreated: 6/2/2026, 11:47:11 PM
Raw incident context
Time: 14:02 UTC. payment-svc p99 latency jumped from 120ms to 4.8s within ~3 minutes. Error rate climbed from 0.1% to 12% (mostly HTTP 500). Application logs (last 5min): repeated "FATAL: sorry, too many clients already" and "connection refused" from payment-svc → postgres-primary. Postgres metrics: - CPU: 35% (normal) - active_connections: 500 / 500 (max_connections) - waiting_queries: 87 - slow_query_log shows a new query running every 30s: SELECT * FROM ledger_entries WHERE status='pending' (no index on status; full table scan over ~12M rows, ~2.4s per execution) Deploy history: - 13:50 UTC — payment-svc v2.41 shipped. Changelog mentions "added nightly settlement batch (cron: */30 * * * *)" - No infra/DB changes in past 7 days. On-call notes (Slack): - 14:03 — CS reports failed checkout volume rising - 14:04 — Pager: PaymentSvcErrorRateHigh - 14:06 — "DB looks healthy, CPU is fine" (initial mis-diagnosis)
Summary
Payment-svc p99 latency spiked from 120ms to 4.8s with 12% HTTP 500 errors due to PostgreSQL connection exhaustion. A new cron job in v2.41 runs a full table scan on ledger_entries every 30s, consuming all 500 connections and blocking legitimate checkout queries.
Severity reasoning: User-facing outage: customers cannot complete checkouts (failed transactions), revenue impact, and high error rate. Scope is all users hitting payment-svc. Reversibility: rolling back the deploy and killing the cron job can restore service, but data integrity of pending entries must be verified.
deepseek-chat·prompt v1·output: en·11636ms·1358↑ / 1651↓ tok·$0.00218
Root cause hypotheses
- highNew cron job in v2.41 runs a full table scan on ledger_entries every 30s, consuming all DB connections and causing connection exhaustion.
Evidence: Deploy at 13:50 UTC added 'nightly settlement batch (cron: */30 * * * *)'. Slow query log shows SELECT * FROM ledger_entries WHERE status='pending' (no index, 2.4s per execution). Active connections at 500/500.
- lowA sudden traffic spike from external clients overwhelmed the DB connection pool.
Evidence: No evidence of traffic spike in provided context; error rate and latency correlate with deploy time, not gradual increase.
- mediumA deadlock or long-running transaction held connections, causing new queries to queue and timeout.
Evidence: No deadlock logs mentioned; waiting_queries=87 suggests queuing, but root cause is connection exhaustion from slow query.
Investigation checklist
- Confirm the cron job process is running and consuming connections.
kubectl exec -n production deploy/payment-svc -- ps aux | grep -i cronExpected: A process matching the cron schedule (e.g., 'settlement' or 'batch') should be visible.
- Check the number of active connections per user/application in PostgreSQL.
psql -h postgres-primary -c "SELECT usename, count(*) FROM pg_stat_activity GROUP BY usename ORDER BY count DESC;"Expected: payment-svc user likely has many connections (near 500).
- Identify the slow query and its execution plan.
psql -h postgres-primary -c "EXPLAIN ANALYZE SELECT * FROM ledger_entries WHERE status='pending';"Expected: Seq Scan on ledger_entries (cost=... rows=... actual time=...). No index used.
- Check if the cron job is the only source of the slow query.
grep -r 'ledger_entries' /etc/cron.d/ /var/spool/cron/ 2>/dev/null || kubectl exec -n production deploy/payment-svc -- cat /proc/1/cmdline | tr '\0' ' 'Expected: Cron entry or command line referencing settlement batch.
- Verify the deploy manifest for v2.41 to confirm the cron addition.
kubectl get deploy payment-svc -n production -o yaml | grep -A5 'cron'Expected: A cron schedule definition (e.g., '*/30 * * * *') in the container args or config.
Mitigation plan
Immediately kill the cron job process to free connections. Use SIGTERM first, then SIGKILL if needed.
Risk: The cron job may be in the middle of processing; pending entries could be left in an inconsistent state. However, this is a read-only query (SELECT), so no data corruption risk.
Rollback: If killing the process does not free connections, restart the payment-svc pod: kubectl rollout restart deploy/payment-svc -n production
Add an index on ledger_entries(status) to prevent full table scans.
Risk: Index creation on a 12M row table may cause brief lock contention. Safer to create concurrently: CREATE INDEX CONCURRENTLY idx_ledger_entries_status ON ledger_entries(status);
Rollback: DROP INDEX IF EXISTS idx_ledger_entries_status;
Reduce max_connections temporarily to prevent overload, or increase connection pool limits in payment-svc.
Risk: Reducing max_connections may cause legitimate requests to fail if pool is too small. Increase gradually after index is created.
Rollback: Set max_connections back to original value (500) in postgresql.conf and reload.
Customer impact
Customers are unable to complete checkouts, receiving HTTP 500 errors. Approximately 12% of all payment requests are failing. No ETA yet; mitigation in progress.
Postmortem draft
Postmortem: Payment-svc Latency Spike and Connection Exhaustion
Date: YYYY-MM-DD Severity: SEV1 Duration: [start] - [end]
Summary
Brief description of the incident.
Timeline
- 13:50 UTC: Deploy v2.41 with new cron job
- 14:02 UTC: Latency spike detected
- 14:04 UTC: Pager triggered
- [Add mitigation steps and resolution time]
Impact
- p99 latency 4.8s (baseline 120ms)
- 12% error rate
- Failed checkouts for customers
Root Cause
New cron job performing full table scan on ledger_entries without index, exhausting DB connections.
What Went Well
- Monitoring detected the issue quickly
- On-call responded within minutes
What Went Poorly
- Initial misdiagnosis (DB looked healthy)
- No pre-deploy review of cron job performance
Action Items
- [ ] Add index on ledger_entries(status)
- [ ] Add connection pool monitoring alert
- [ ] Implement pre-deploy performance testing for new cron jobs
- [ ] Review deploy process for adding background jobs
Follow-ups
- P0Add index on ledger_entries(status) to prevent full table scans.— service owner
- P1Implement alert on DB connection utilization >80%.— platform team
- P1Add pre-deploy performance testing for new cron jobs (e.g., run EXPLAIN ANALYZE on queries).— service owner
- P2Review deploy process to require approval for changes that add background jobs.— platform team
- P2Update runbook for payment-svc with steps to handle connection exhaustion.— on-call SRE
Similar past incidents
lexical match (pg_trgm)
- 100%
[Eval][v1][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 77%
[Eval][v2][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 73%
[Scenario] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 50%
[Eval][v1][zh] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 45%
[Eval][v2][zh] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts