[Eval][v2][en] Payment service connection pool exhaustion after batch job deploy
service: payment-svccreated: 6/2/2026, 11:48:11 PM
Raw incident context
Time: 14:02 UTC. payment-svc p99 latency jumped from 120ms to 4.8s within ~3 minutes. Error rate climbed from 0.1% to 12% (mostly HTTP 500). Application logs (last 5min): repeated "FATAL: sorry, too many clients already" and "connection refused" from payment-svc → postgres-primary. Postgres metrics: - CPU: 35% (normal) - active_connections: 500 / 500 (max_connections) - waiting_queries: 87 - slow_query_log shows a new query running every 30s: SELECT * FROM ledger_entries WHERE status='pending' (no index on status; full table scan over ~12M rows, ~2.4s per execution) Deploy history: - 13:50 UTC — payment-svc v2.41 shipped. Changelog mentions "added nightly settlement batch (cron: */30 * * * *)" - No infra/DB changes in past 7 days. On-call notes (Slack): - 14:03 — CS reports failed checkout volume rising - 14:04 — Pager: PaymentSvcErrorRateHigh - 14:06 — "DB looks healthy, CPU is fine" (initial mis-diagnosis)
Summary
Payment service (payment-svc) is experiencing a severe outage with p99 latency spiking to 4.8s and 12% error rate due to PostgreSQL connection pool exhaustion caused by a new unindexed query from a recently deployed cron job.
Severity reasoning: User-facing outage: customers report failed checkouts, error rate >1% for >5 min (12% for ~10 min), revenue path broken. This meets SEV1 criteria.
deepseek-chat·prompt v2·output: en·12715ms·1739↑ / 1860↓ tok·$0.00252
Root cause hypotheses
- highNew cron job in payment-svc v2.41 runs an unindexed query (SELECT * FROM ledger_entries WHERE status='pending') every 30s, consuming DB connections and causing connection pool exhaustion.
Evidence: Deploy at 13:50 UTC added cron job; slow query log shows the query taking ~2.4s; active_connections at max (500); app logs show 'too many clients already'.
- mediumConnection pool leak in payment-svc due to a bug in v2.41, causing connections not to be released.
Evidence: Active connections at max; no other changes; but CPU is normal, suggesting connections are idle or waiting.
- lowExternal traffic spike overwhelming payment-svc, causing cascading connection exhaustion.
Evidence: No traffic spike reported; error rate correlates with deploy time; latency increase is consistent with DB bottleneck.
Investigation checklist
- Check the cron job configuration in the new deployment to confirm the query and schedule.
kubectl get cronjob -n prod -l app=payment-svc -o yaml | grep -A10 'schedule\|command\|args'Expected: Should show schedule */30 * * * * and a command containing 'SELECT * FROM ledger_entries WHERE status=pending'
- Verify the query plan and index status on ledger_entries.
EXPLAIN ANALYZE SELECT * FROM ledger_entries WHERE status='pending';Expected: Should show Seq Scan on ledger_entries (cost=... rows=... actual time=... loops=1) and no index usage.
- Check current active connections and their states on PostgreSQL.
SELECT pid, query_start, state, wait_event_type, query FROM pg_stat_activity WHERE state != 'idle' ORDER BY query_start LIMIT 20;Expected: Multiple connections showing the slow query or waiting state.
- Check if the cron job is still running and consuming connections.
kubectl get pods -n prod -l app=payment-svc --field-selector=status.phase=Running | grep -i cronExpected: Pods named payment-svc-cron-* should be running.
- Review payment-svc logs for connection pool errors.
kubectl logs -n prod -l app=payment-svc --since=15m | grep -iE 'connection pool|too many clients|connection refused' | tail -50Expected: Repeated 'FATAL: sorry, too many clients already' and 'connection refused' errors.
Mitigation plan
Immediately scale down the cron job to stop the unindexed query: kubectl scale cronjob -n prod payment-svc-cron --replicas=0
Risk: Stops the nightly settlement batch; no revenue impact as it's non-critical. No destructive ops.
Rollback: Scale back up: kubectl scale cronjob -n prod payment-svc-cron --replicas=1
Kill the slow queries currently running to free up connections: SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE query LIKE '%ledger_entries%' AND state = 'active';
Risk: Terminates active queries; may cause incomplete batch processing but safe as batch can be re-run. No data loss.
Rollback: No rollback needed; queries will be re-issued by cron if re-enabled.
Increase max_connections temporarily to allow recovery: ALTER SYSTEM SET max_connections = 800; SELECT pg_reload_conf();
Risk: May increase memory pressure on DB; monitor memory usage. Safe if memory headroom exists.
Rollback: Set back to 500: ALTER SYSTEM SET max_connections = 500; SELECT pg_reload_conf();
Customer impact
Customers are experiencing failed checkouts and errors when trying to complete payments. Approximately 12% of payment attempts are failing with HTTP 500 errors. The issue started around 14:02 UTC and is ongoing. No ETA yet.
Postmortem draft
Summary
[FILL IN]
Timeline (UTC)
- 13:50 - Deployment of payment-svc v2.41 with new cron job
- 14:02 - p99 latency spikes to 4.8s, error rate climbs
- 14:03 - CS reports failed checkouts
- 14:04 - Pager alert triggered
- 14:06 - Initial mis-diagnosis: DB CPU looks fine
- [FILL IN] - Root cause identified: unindexed query from cron job
- [FILL IN] - Mitigation applied: cron job scaled down, slow queries killed
Impact
- 12% error rate on payment-svc for ~[FILL IN] minutes
- p99 latency 4.8s (normal 120ms)
- Customer-facing checkout failures
Root Cause
A new cron job added in v2.41 runs a query SELECT * FROM ledger_entries WHERE status='pending' every 30 seconds. The query performs a full table scan on a 12M-row table without an index, taking ~2.4s each. This exhausted the PostgreSQL connection pool (max 500), causing connection refused errors and cascading failures.
Detection
Pager alert PaymentSvcErrorRateHigh at 14:04 UTC. Customer reports via CS at 14:03.
Response
- [FILL IN] - Scaled down cron job
- [FILL IN] - Killed slow queries
- [FILL IN] - Increased max_connections temporarily
What Went Well
- Alerting worked
- On-call responded quickly
- Logs and metrics clearly pointed to DB connection exhaustion
What Went Poorly
- Initial mis-diagnosis: DB CPU looked fine, so DB was initially ruled out
- No pre-deployment review of the cron job query
- Missing index on ledger_entries.status
Action Items
- [FILL IN] - Add index on ledger_entries.status
- [FILL IN] - Add query performance regression tests in CI
- [FILL IN] - Add monitoring for connection pool utilization and slow queries
- [FILL IN] - Review deployment process for cron jobs
Follow-ups
- P0Add index on ledger_entries.status to prevent full table scans.— Database team
- P1Add monitoring alert for DB connection pool utilization > 80%.— Platform team
- P1Add slow query monitoring and alerting on PostgreSQL.— Platform team
- P1Implement pre-deployment query performance review for cron jobs.— Service owner
- P2Add integration test that runs the cron job query against a staging database with representative data.— QA team
- P2Review deployment process to require performance testing for new cron jobs.— Service owner
Similar past incidents
lexical match (pg_trgm)
- 71%
[Eval][v2][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 66%
[Scenario] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 66%
[Eval][v1][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 46%
[Eval][v2][zh] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 46%
[Eval][v1][zh] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts