[Eval][v2][en] Payment service connection pool exhaustion after batch job deploy
service: payment-svccreated: 6/2/2026, 11:48:25 PM
Raw incident context
Time: 14:02 UTC. payment-svc p99 latency jumped from 120ms to 4.8s within ~3 minutes. Error rate climbed from 0.1% to 12% (mostly HTTP 500). Application logs (last 5min): repeated "FATAL: sorry, too many clients already" and "connection refused" from payment-svc → postgres-primary. Postgres metrics: - CPU: 35% (normal) - active_connections: 500 / 500 (max_connections) - waiting_queries: 87 - slow_query_log shows a new query running every 30s: SELECT * FROM ledger_entries WHERE status='pending' (no index on status; full table scan over ~12M rows, ~2.4s per execution) Deploy history: - 13:50 UTC — payment-svc v2.41 shipped. Changelog mentions "added nightly settlement batch (cron: */30 * * * *)" - No infra/DB changes in past 7 days. On-call notes (Slack): - 14:03 — CS reports failed checkout volume rising - 14:04 — Pager: PaymentSvcErrorRateHigh - 14:06 — "DB looks healthy, CPU is fine" (initial mis-diagnosis)
Summary
Payment service p99 latency spiked from 120ms to 4.8s and error rate rose to 12% due to a new cron job in v2.41 that runs a full table scan on `ledger_entries` without an index, exhausting the Postgres connection pool. Customers are experiencing failed checkouts.
Severity reasoning: User-facing outage with error rate >1% for >5 minutes (12% for ~10 minutes) and revenue path broken (failed checkouts).
deepseek-chat·prompt v2·output: en·10429ms·1739↑ / 1585↓ tok·$0.00221
Root cause hypotheses
- highNew cron job in payment-svc v2.41 runs SELECT * FROM ledger_entries WHERE status='pending' every 30s, causing full table scan and long-running queries that exhaust DB connections.
Evidence: Deploy at 13:50 UTC added cron job; slow query log shows this query running every 30s with ~2.4s execution; active_connections at max (500) with 87 waiting queries.
- mediumConnection pool leak in payment-svc due to a bug in v2.41, causing connections not to be returned to pool.
Evidence: Active connections at max; but no evidence of leaked connections in logs (no 'connection not closed' warnings).
- lowExternal traffic spike causing increased load on payment-svc and DB.
Evidence: No traffic spike observed in metrics; latency and error rate correlate with deploy time, not traffic.
Investigation checklist
- Check if the cron job is the source of the slow query by examining its SQL and execution plan.
kubectl exec -n prod deploy/payment-svc -- cat /app/cron/settlement.sh | grep -i 'SELECT.*ledger_entries'Expected: Should show the SELECT query with WHERE status='pending' and no index hint.
- Confirm the slow query is consuming connections by checking pg_stat_activity for the query.
psql -h postgres-primary -U postgres -c "SELECT pid, query_start, state, wait_event_type, query FROM pg_stat_activity WHERE query LIKE '%ledger_entries%' AND state != 'idle' ORDER BY query_start;"Expected: Multiple rows with the same query, long query_start times, and state 'active'.
- Check if the cron job is scheduled to run every 30 minutes (as per changelog) and if it's currently running.
kubectl get cronjob -n prod -l app=payment-svc -o jsonpath='{.items[*].spec.schedule}'Expected: Should show '*/30 * * * *' or similar.
- Verify the DB connection pool settings in payment-svc config.
kubectl exec -n prod deploy/payment-svc -- cat /app/config.yaml | grep -A5 'connection_pool'Expected: Should show max pool size (likely 500) and timeout settings.
- Check if there are any other long-running queries or locks.
psql -h postgres-primary -U postgres -c "SELECT pid, now() - pg_stat_activity.query_start AS duration, query, state FROM pg_stat_activity WHERE state != 'idle' ORDER BY duration DESC LIMIT 10;"Expected: The slow query should be at the top with duration >2s.
Mitigation plan
Immediately disable the cron job by scaling the cronjob to 0 replicas or suspending it.
Risk: Nightly settlement batch will not run; may cause delayed reconciliation but no data loss.
Rollback: Re-enable cron job by scaling back to 1 replica or unsuspending.
Kill the long-running queries to free up connections.
Risk: May cause incomplete settlement processing; safe as queries are SELECT only.
Rollback: No rollback needed; queries can be re-run later.
Add an index on ledger_entries.status to prevent full table scans.
Risk: Index creation may lock table briefly; consider CONCURRENTLY to avoid downtime.
Rollback: Drop the index if it causes issues.
Customer impact
Customers are experiencing failed checkouts and payment processing errors. Approximately 12% of checkout attempts are failing with HTTP 500 errors. Affected users see an error message and cannot complete purchases. No ETA for full recovery yet.
Postmortem draft
Summary
[FILL IN]
Timeline (UTC)
- 13:50: Deploy payment-svc v2.41 with new cron job
- 14:02: p99 latency spikes to 4.8s, error rate rises
- 14:03: CS reports failed checkouts
- 14:04: Pager alert
- 14:06: Initial mis-diagnosis: DB CPU fine
- [FILL IN]: Root cause identified
- [FILL IN]: Mitigation applied
Impact
- 12% error rate on payment-svc for ~10 minutes
- Failed checkouts for customers
- No data loss
Root Cause
New cron job in v2.41 runs a full table scan on ledger_entries without an index, exhausting DB connection pool.
Detection
Pager alert for high error rate; customer reports.
Response
- Disabled cron job
- Killed long-running queries
- Added index
What Went Well
- Quick deploy rollback capability
- DB metrics available
What Went Poorly
- Initial mis-diagnosis (DB CPU fine)
- No pre-deploy review of cron job SQL
- Missing index on status column
Action Items
- [FILL IN]
Follow-ups
- P0Add index on ledger_entries.status to prevent full table scans.— Database team
- P1Review and test all cron jobs in staging before production deploy.— Service owner
- P1Add monitoring for long-running queries and connection pool saturation.— Platform team
- P2Implement pre-deploy SQL review for any new queries in deployments.— Service owner
- P2Update runbook for payment-svc to include steps for connection pool exhaustion.— On-call SRE
Similar past incidents
lexical match (pg_trgm)
- 89%
[Eval][v2][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 82%
[Scenario] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 76%
[Eval][v1][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 48%
[Eval][v1][zh] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 45%
[Eval][v2][zh] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts