[Eval][v3][en] Payment service connection pool exhaustion after batch job deploy
service: payment-svccreated: 6/2/2026, 11:50:04 PM
Raw incident context
Time: 14:02 UTC. payment-svc p99 latency jumped from 120ms to 4.8s within ~3 minutes. Error rate climbed from 0.1% to 12% (mostly HTTP 500). Application logs (last 5min): repeated "FATAL: sorry, too many clients already" and "connection refused" from payment-svc → postgres-primary. Postgres metrics: - CPU: 35% (normal) - active_connections: 500 / 500 (max_connections) - waiting_queries: 87 - slow_query_log shows a new query running every 30s: SELECT * FROM ledger_entries WHERE status='pending' (no index on status; full table scan over ~12M rows, ~2.4s per execution) Deploy history: - 13:50 UTC — payment-svc v2.41 shipped. Changelog mentions "added nightly settlement batch (cron: */30 * * * *)" - No infra/DB changes in past 7 days. On-call notes (Slack): - 14:03 — CS reports failed checkout volume rising - 14:04 — Pager: PaymentSvcErrorRateHigh - 14:06 — "DB looks healthy, CPU is fine" (initial mis-diagnosis)
Summary
Payment service p99 latency spiked from 120ms to 4.8s and error rate rose to 12% due to PostgreSQL connection pool exhaustion caused by a new cron job in v2.41 that runs a full table scan every 30s, maxing out connections at 500.
Severity reasoning: User-facing outage: error rate > 1% for >5 min (12% for ~10 min), revenue path broken (failed checkouts). Matches SEV1 criteria.
deepseek-chat·prompt v3·output: en·10682ms·1990↑ / 1615↓ tok·$0.00231
Root cause hypotheses
- highNew cron job in payment-svc v2.41 runs a full table scan on ledger_entries every 30s, holding connections open and exhausting the pool.
Evidence: Deploy at 13:50 UTC added cron '*/30 * * * *' for settlement batch. Slow query log shows SELECT * FROM ledger_entries WHERE status='pending' (no index, 2.4s, 12M rows). Active connections at 500/500.
- mediumConnection pool leak in payment-svc v2.41 code changes.
Evidence: No direct evidence of leak; connection count is exactly max_connections, suggesting pool exhaustion rather than leak. But code change could have introduced a leak.
- lowExternal traffic spike causing more concurrent requests.
Evidence: No traffic spike reported; latency and errors correlate with deploy time, not traffic.
Investigation checklist
- Check active connections and waiting queries on PostgreSQL.
SELECT count(*) FROM pg_stat_activity WHERE state = 'active'; SELECT count(*) FROM pg_stat_activity WHERE wait_event_type = 'Lock';Expected: Active connections near 500; waiting queries > 0 confirms pool exhaustion.
- Identify the query consuming resources and its source.
SELECT pid, query_start, state, wait_event_type, query FROM pg_stat_activity WHERE state != 'idle' ORDER BY query_start LIMIT 20;Expected: Multiple entries of SELECT * FROM ledger_entries WHERE status='pending' with long query_start.
- Check payment-svc logs for connection errors.
kubectl logs -n prod -l app=payment-svc --since=15m | grep -iE 'too many clients|connection refused' | tail -20Expected: Repeated 'FATAL: sorry, too many clients already' and 'connection refused'.
- Verify the new cron job in the deployment.
kubectl get cronjob -n prod -l app=payment-svc -o yaml | grep -A5 scheduleExpected: Schedule: */30 * * * * for settlement-batch.
- Check if the query can be optimized with an index.
SELECT tablename, indexname, indexdef FROM pg_indexes WHERE tablename = 'ledger_entries';Expected: No index on status column.
Mitigation plan
Immediately kill the slow queries to free up connections.
Risk: May cause incomplete settlement batch; no data loss.
Rollback: Re-run the batch manually after mitigation.
Temporarily increase max_connections to 1000 to relieve pressure.
Risk: May increase DB memory usage; monitor CPU and memory.
Rollback: ALTER SYSTEM SET max_connections=500; SELECT pg_reload_conf();
Rollback payment-svc to v2.40 to remove the cron job.
Risk: Rollback may cause brief downtime; but safe as v2.40 was stable.
Rollback: Re-deploy v2.41 after fixing the issue.
Add an index on ledger_entries.status to speed up the query.
Risk: Index creation may lock table briefly; consider CONCURRENTLY.
Rollback: DROP INDEX IF EXISTS idx_ledger_entries_status;
Customer impact
Customers are experiencing failed checkouts and errors when trying to complete payments. Approximately 12% of requests are failing with HTTP 500 errors. The issue started around 14:00 UTC and is ongoing.
Postmortem draft
Summary
Payment service experienced a SEV1 outage from 14:02 UTC to [FILL IN] due to a new cron job in v2.41 that exhausted PostgreSQL connections.
Timeline (UTC)
- 13:50 - Deploy payment-svc v2.41 with settlement batch cron
- 14:02 - p99 latency spikes to 4.8s, error rate climbs
- 14:03 - CS reports failed checkouts
- 14:04 - Pager alert
- 14:06 - Initial mis-diagnosis: 'DB looks healthy'
- [FILL IN] - Mitigation applied
Impact
- p99 latency: 120ms → 4.8s
- Error rate: 0.1% → 12%
- Affected users: all customers attempting checkout
Root Cause
New cron job running full table scan on ledger_entries (12M rows) every 30s, holding connections and exhausting max_connections (500).
Detection
Pager alert for PaymentSvcErrorRateHigh. CS reports via Slack.
Response
- Killed slow queries
- Increased max_connections
- Rolled back to v2.40
- Added index on status
What Went Well
- Quick identification of slow query
- Rollback was straightforward
What Went Poorly
- Initial mis-diagnosis delayed action
- No index on status column
- No load testing for new cron
Action Items
- Add index on ledger_entries.status
- Add connection pool monitoring alert
- Add query performance regression tests in CI
- Review cron job implementation for efficiency
Follow-ups
- P0Add index on ledger_entries.status column— database-administrator
- P1Set up alert on PostgreSQL active connections > 80% of max_connections— on-call SRE
- P1Add query performance regression tests in CI for new cron jobs— platform-team
- P2Review cron job implementation to use batch processing with limits— payments-platform
- P2Update runbook for connection pool exhaustion with specific commands— on-call SRE
Similar past incidents
lexical match (pg_trgm)
- 82%
[Scenario] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 76%
[Eval][v1][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 76%
[Eval][v2][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 49%
[Eval][v2][zh] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 46%
[Eval][v1][zh] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts