[Eval][v2][en] Payment service connection pool exhaustion after batch job deploy
service: payment-svccreated: 6/2/2026, 11:35:28 PM
Raw incident context
Time: 14:02 UTC. payment-svc p99 latency jumped from 120ms to 4.8s within ~3 minutes. Error rate climbed from 0.1% to 12% (mostly HTTP 500). Application logs (last 5min): repeated "FATAL: sorry, too many clients already" and "connection refused" from payment-svc → postgres-primary. Postgres metrics: - CPU: 35% (normal) - active_connections: 500 / 500 (max_connections) - waiting_queries: 87 - slow_query_log shows a new query running every 30s: SELECT * FROM ledger_entries WHERE status='pending' (no index on status; full table scan over ~12M rows, ~2.4s per execution) Deploy history: - 13:50 UTC — payment-svc v2.41 shipped. Changelog mentions "added nightly settlement batch (cron: */30 * * * *)" - No infra/DB changes in past 7 days. On-call notes (Slack): - 14:03 — CS reports failed checkout volume rising - 14:04 — Pager: PaymentSvcErrorRateHigh - 14:06 — "DB looks healthy, CPU is fine" (initial mis-diagnosis)
Summary
Payment service p99 latency spiked from 120ms to 4.8s and error rate rose to 12% due to a new cron job in v2.41 exhausting the Postgres connection pool (max 500) with a full table scan query on ledger_entries. Customers are experiencing failed checkouts.
Severity reasoning: User-facing outage with >1% error rate (12%) for >5 minutes, revenue path broken (failed checkouts).
deepseek-chat·prompt v2·output: en·10567ms·1739↑ / 1480↓ tok·$0.00210
Root cause hypotheses
- highNew cron job in payment-svc v2.41 runs a slow query every 30s, consuming connections and blocking others.
Evidence: Deploy at 13:50 UTC added cron '*/30 * * * *' for settlement batch; slow query log shows SELECT * FROM ledger_entries WHERE status='pending' (no index, 2.4s) running every 30s; active_connections at max 500.
- mediumConnection pool leak in payment-svc due to unclosed connections from the new code.
Evidence: Application logs show 'too many clients already' but no explicit leak evidence; connection count is exactly max_connections.
- lowExternal traffic spike causing increased connection demand.
Evidence: No traffic spike observed; error rate correlates with deploy time, not traffic patterns.
Investigation checklist
- Check if the slow query is from the new cron job by examining the application logs for the cron execution.
kubectl logs -n prod -l app=payment-svc --since=30m | grep -E 'settlement|cron|ledger_entries' | head -20Expected: Lines showing cron job execution and the slow query being logged.
- Identify the source of connections consuming the pool.
SELECT pid, application_name, client_addr, query, state, query_start FROM pg_stat_activity WHERE state != 'idle' ORDER BY query_start LIMIT 20;Expected: Multiple connections running the slow SELECT * FROM ledger_entries query, all from payment-svc pods.
- Check if the slow query can be terminated to free connections.
SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE query LIKE '%ledger_entries%' AND state = 'active';Expected: Terminates the blocking queries; connections freed.
- Verify the deploy manifest for the cron job configuration.
kubectl get deployment payment-svc -n prod -o yaml | grep -A 10 'cron'Expected: Cron schedule and command for the settlement batch.
Mitigation plan
Immediately terminate the slow queries to free connections and restore service.
Risk: Terminating queries may leave ledger_entries in an inconsistent state; the batch may need to be re-run.
Rollback: If service does not recover, restart payment-svc pods to reset connection pools.
Rollback payment-svc to previous version v2.40 to remove the cron job.
Risk: Rollback may cause brief downtime during deployment; ensure no data loss from the batch.
Rollback: Re-deploy v2.41 if rollback causes issues, but first disable the cron via feature flag.
Add an index on ledger_entries.status to prevent full table scans.
Risk: Index creation locks the table briefly; may cause short latency spike.
Rollback: Drop the index if it causes performance degradation.
Customer impact
Customers are experiencing failed checkouts and payment processing errors. Approximately 12% of requests are failing with HTTP 500. Affected users see 'payment failed' messages. Estimated impact: thousands of users over the last 15 minutes.
Postmortem draft
Summary
[FILL IN]
Timeline (UTC)
- 13:50: Deploy payment-svc v2.41 with new cron job
- 14:02: p99 latency spikes to 4.8s, error rate rises
- 14:03: CS reports failed checkouts
- 14:04: Pager alert
- 14:06: Initial mis-diagnosis (DB CPU fine)
- [FILL IN] Mitigation actions
Impact
- 12% error rate on payment-svc for ~15 minutes
- Failed checkouts for customers
Root Cause
New cron job in v2.41 runs a full table scan query every 30s on ledger_entries (no index on status), exhausting Postgres connection pool.
Detection
Pager alert and customer reports.
Response
[FILL IN]
What Went Well
[FILL IN]
What Went Poorly
- Initial mis-diagnosis (CPU fine ≠ DB healthy)
- No pre-deploy review of new cron job
- Missing index on ledger_entries.status
Action Items
- Add index on ledger_entries.status
- Add connection pool monitoring alert
- Add slow query alert for queries >1s
- Review deploy process for cron jobs
- Add feature flag for new batch jobs
Follow-ups
- P0Add index on ledger_entries.status column— Database team
- P1Set up alert on Postgres active_connections > 80% max— Platform team
- P1Add slow query monitoring and alert for queries >1s— Observability team
- P2Review deploy process to require approval for new cron jobs— Service owner
- P1Implement feature flag for settlement batch to disable without rollback— Payment-svc team
- P2Update runbook for connection pool exhaustion— On-call SRE
Similar past incidents
lexical match (pg_trgm)
- 98%
[Eval][v2][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 85%
[Scenario] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 78%
[Eval][v1][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 49%
[Eval][v1][zh] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 47%
[Eval][v2][zh] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts