[Eval][v1][en] Payment service connection pool exhaustion after batch job deploy
service: payment-svccreated: 6/2/2026, 11:34:53 PM
Raw incident context
Time: 14:02 UTC. payment-svc p99 latency jumped from 120ms to 4.8s within ~3 minutes. Error rate climbed from 0.1% to 12% (mostly HTTP 500). Application logs (last 5min): repeated "FATAL: sorry, too many clients already" and "connection refused" from payment-svc → postgres-primary. Postgres metrics: - CPU: 35% (normal) - active_connections: 500 / 500 (max_connections) - waiting_queries: 87 - slow_query_log shows a new query running every 30s: SELECT * FROM ledger_entries WHERE status='pending' (no index on status; full table scan over ~12M rows, ~2.4s per execution) Deploy history: - 13:50 UTC — payment-svc v2.41 shipped. Changelog mentions "added nightly settlement batch (cron: */30 * * * *)" - No infra/DB changes in past 7 days. On-call notes (Slack): - 14:03 — CS reports failed checkout volume rising - 14:04 — Pager: PaymentSvcErrorRateHigh - 14:06 — "DB looks healthy, CPU is fine" (initial mis-diagnosis)
Summary
Payment service p99 latency spiked to 4.8s and error rate rose to 12% due to a new cron job in v2.41 that runs a full table scan on ledger_entries every 30s, exhausting the PostgreSQL connection pool (max 500). Customers experience failed checkouts.
Severity reasoning: User-facing outage with revenue impact (failed checkouts), high error rate (12%), and full connection pool exhaustion. SEV1 per rubric.
deepseek-chat·prompt v1·output: en·9341ms·1358↑ / 1364↓ tok·$0.00187
Root cause hypotheses
- highNew cron job in payment-svc v2.41 runs SELECT * FROM ledger_entries WHERE status='pending' every 30s, causing full table scan and long-running queries that hold connections.
Evidence: Deploy at 13:50 UTC introduced cron with */30 * * * *; slow query log shows this query running every 30s with ~2.4s execution; active_connections at max (500) with 87 waiting queries.
- mediumConnection pool leak in payment-svc v2.41, where connections are not released after query completion.
Evidence: Active connections at max and waiting queries; but slow query log shows long-running queries, suggesting connections are held by those queries rather than leaked.
- lowSudden traffic spike causing normal queries to queue up and exhaust connections.
Evidence: No evidence of traffic spike in context; error rate and latency correlate exactly with deploy time.
Investigation checklist
- Check if the new cron job is running and its query plan.
kubectl exec -n prod deploy/payment-svc -- sh -c 'ps aux | grep settlement' && kubectl logs -n prod deploy/payment-svc --tail=50 | grep -i cronExpected: See settlement cron process; logs show cron trigger every 30 min.
- Confirm the slow query and its impact on connections.
psql -h postgres-primary -U postgres -c "SELECT pid, state, query_start, query FROM pg_stat_activity WHERE query LIKE '%ledger_entries%' AND state = 'active';"Expected: Multiple active queries with long query_start times (e.g., >2s).
- Check if an index exists on ledger_entries.status.
psql -h postgres-primary -U postgres -c "\d ledger_entries" | grep -i indexExpected: No index on status column.
- Verify connection pool exhaustion.
psql -h postgres-primary -U postgres -c "SELECT count(*) FROM pg_stat_activity;"Expected: Count near 500.
Mitigation plan
Immediately disable the cron job by scaling payment-svc replicas to 0 temporarily or using a feature flag to disable the settlement batch.
Risk: Settlement batch will not run; manual settlement may be needed later. No customer impact from disabling cron.
Rollback: Re-enable cron by scaling replicas back up or toggling feature flag.
Kill the long-running queries to free connections.
Risk: Queries are read-only; killing them is safe. May cause temporary incomplete results for the batch.
Rollback: None needed; queries will restart on next cron run.
Add an index on ledger_entries.status to prevent full table scans.
Risk: Index creation may lock table briefly; use CONCURRENTLY to avoid downtime.
Rollback: DROP INDEX IF EXISTS idx_ledger_entries_status;
Customer impact
Customers are unable to complete checkout, receiving HTTP 500 errors. Estimated time to resolution: 15 minutes after mitigation actions.
Postmortem draft
Postmortem: Payment Service Latency Spike (SEV1)
Summary: [2-3 sentences]
Timeline:
- 13:50 UTC - Deploy v2.41 with settlement cron
- 14:02 UTC - Latency spike, error rate increase
- 14:04 UTC - Pager triggered
- [Mitigation time] - Cron disabled, queries killed, index added
Impact: 12% error rate, p99 latency 4.8s, failed checkouts
Root Cause: New cron job performing full table scan without index, exhausting connection pool
What Went Well: Quick detection via monitoring, clear logs
What Went Poorly: No pre-deploy review of cron query, no index on status column
Action Items:
- [ ] Add index on ledger_entries.status
- [ ] Add query performance review to deploy checklist
- [ ] Set up connection pool monitoring alert
- [ ] Implement feature flag for cron jobs
Follow-ups
- P0Add index on ledger_entries.status (CONCURRENTLY)— service owner
- P1Review deploy v2.41 changelog and ensure no other hidden changes— service owner
- P1Add monitoring alert for active connections > 80% of max_connections— platform team
- P2Implement feature flag for cron jobs to allow quick disable— service owner
- P2Update deploy checklist to include query plan review for new queries— platform team
Similar past incidents
lexical match (pg_trgm)
- 86%
[Eval][v2][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 86%
[Scenario] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 80%
[Eval][v1][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 50%
[Eval][v1][zh] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 48%
[Eval][v2][zh] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts