[Eval][v1][en] Payment service connection pool exhaustion after batch job deploy
service: payment-svccreated: 6/2/2026, 11:46:42 PM
Raw incident context
Time: 14:02 UTC. payment-svc p99 latency jumped from 120ms to 4.8s within ~3 minutes. Error rate climbed from 0.1% to 12% (mostly HTTP 500). Application logs (last 5min): repeated "FATAL: sorry, too many clients already" and "connection refused" from payment-svc → postgres-primary. Postgres metrics: - CPU: 35% (normal) - active_connections: 500 / 500 (max_connections) - waiting_queries: 87 - slow_query_log shows a new query running every 30s: SELECT * FROM ledger_entries WHERE status='pending' (no index on status; full table scan over ~12M rows, ~2.4s per execution) Deploy history: - 13:50 UTC — payment-svc v2.41 shipped. Changelog mentions "added nightly settlement batch (cron: */30 * * * *)" - No infra/DB changes in past 7 days. On-call notes (Slack): - 14:03 — CS reports failed checkout volume rising - 14:04 — Pager: PaymentSvcErrorRateHigh - 14:06 — "DB looks healthy, CPU is fine" (initial mis-diagnosis)
Summary
Payment-svc p99 latency spiked from 120ms to 4.8s with 12% HTTP 500 errors due to PostgreSQL connection exhaustion caused by a new cron job (shipped at 13:50 UTC) running a full table scan on ledger_entries every 30s, consuming all 500 max_connections.
Severity reasoning: User-facing outage: customers cannot complete checkouts (revenue impact). 12% error rate and high latency affect a large subset of users. Reversible by disabling the cron job or adding an index, but immediate mitigation required.
deepseek-chat·prompt v1·output: en·9568ms·1358↑ / 1428↓ tok·$0.00194
Root cause hypotheses
- highNew cron job in payment-svc v2.41 runs a full table scan on ledger_entries every 30s, holding connections open and exhausting the pool.
Evidence: Deploy at 13:50 UTC matches symptom onset. Slow query log shows SELECT * FROM ledger_entries WHERE status='pending' (no index, 2.4s per execution) running every 30s. active_connections at max (500).
- mediumConnection pool leak in payment-svc v2.41 causes connections not to be returned to pool after query execution.
Evidence: Connection count hit max_connections rapidly; application logs show 'too many clients already'. However, the cron job pattern is more consistent with the periodic query.
- lowExternal DDoS or traffic spike overwhelming payment-svc and DB connections.
Evidence: No evidence of traffic spike in provided context; error rate correlates with deploy time, not gradual increase.
Investigation checklist
- Check if the new cron job is running and consuming connections.
kubectl exec -n prod deploy/payment-svc -- ps aux | grep settlementExpected: A process running the settlement batch (e.g., 'settlement-cron') should appear.
- Confirm the slow query and its frequency.
kubectl exec -n prod svc/postgres-primary -- psql -c "SELECT query, calls, total_time, rows FROM pg_stat_statements ORDER BY total_time DESC LIMIT 5;"Expected: The query 'SELECT * FROM ledger_entries WHERE status='pending'' should be near the top with high total_time.
- Check current active connections and which application is using them.
kubectl exec -n prod svc/postgres-primary -- psql -c "SELECT state, count(*) FROM pg_stat_activity GROUP BY state;"Expected: Many connections in 'active' state, possibly all from payment-svc.
- Verify the deploy rollout status.
kubectl rollout status -n prod deploy/payment-svcExpected: Rollout complete at 13:50 UTC.
- Check if there is a feature flag to disable the cron job.
kubectl exec -n prod deploy/payment-svc -- env | grep SETTLEMENTExpected: An environment variable like SETTLEMENT_CRON_ENABLED=true might exist.
Mitigation plan
Immediately disable the settlement cron job by scaling down the cron workload or toggling a feature flag.
Risk: Settlement batch will not run, potentially delaying financial reconciliation. No data loss.
Rollback: Re-enable the cron job after adding an index on ledger_entries.status.
Kill existing slow queries to free up connections.
Risk: May cause incomplete transactions; ensure no data corruption by checking for open transactions first.
Rollback: No rollback needed; queries will be retried.
Add an index on ledger_entries.status to prevent full table scans.
Risk: Index creation may lock the table briefly; schedule during low traffic or use CONCURRENTLY.
Rollback: DROP INDEX IF EXISTS idx_ledger_entries_status;
Customer impact
Customers are experiencing failed checkouts with HTTP 500 errors. Approximately 12% of checkout attempts are failing. No ETA yet; mitigation in progress.
Postmortem draft
Postmortem: Payment-svc Latency Spike and Error Rate
Date: YYYY-MM-DD Severity: SEV1 Summary: 2-3 sentence summary
Timeline
- 13:50 UTC - Deploy payment-svc v2.41 with settlement cron
- 14:02 UTC - Latency spike detected
- 14:04 UTC - Pager triggered
- 14:06 UTC - Initial misdiagnosis (DB CPU fine)
- [Add mitigation times]
Impact
- p99 latency 4.8s, 12% error rate
- Failed checkouts for customers
Root Cause
New cron job running full table scan every 30s exhausted DB connections.
What Went Well
- Monitoring detected issue quickly
What Went Poorly
- Initial misdiagnosis delayed mitigation
- No index on status column
- Cron job not tested under load
Action Items
- [ ] Add index on ledger_entries.status
- [ ] Add connection pool monitoring alert
- [ ] Add load test for cron jobs
- [ ] Review deploy process for cron changes
Follow-ups
- P0Add index on ledger_entries.status to prevent full table scans.— service owner
- P1Set up alert on active_connections approaching max_connections.— platform team
- P1Add load testing for new cron jobs before production deploy.— service owner
- P2Review deploy process to require performance review for cron changes.— platform team
Similar past incidents
lexical match (pg_trgm)
- 80%
[Eval][v1][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 71%
[Eval][v2][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 67%
[Scenario] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 50%
[Eval][v1][zh] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 46%
[Eval][v2][zh] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts