[Eval][v2][en] Payment service connection pool exhaustion after batch job deploy
service: payment-svccreated: 6/2/2026, 11:48:38 PM
Raw incident context
Time: 14:02 UTC. payment-svc p99 latency jumped from 120ms to 4.8s within ~3 minutes. Error rate climbed from 0.1% to 12% (mostly HTTP 500). Application logs (last 5min): repeated "FATAL: sorry, too many clients already" and "connection refused" from payment-svc → postgres-primary. Postgres metrics: - CPU: 35% (normal) - active_connections: 500 / 500 (max_connections) - waiting_queries: 87 - slow_query_log shows a new query running every 30s: SELECT * FROM ledger_entries WHERE status='pending' (no index on status; full table scan over ~12M rows, ~2.4s per execution) Deploy history: - 13:50 UTC — payment-svc v2.41 shipped. Changelog mentions "added nightly settlement batch (cron: */30 * * * *)" - No infra/DB changes in past 7 days. On-call notes (Slack): - 14:03 — CS reports failed checkout volume rising - 14:04 — Pager: PaymentSvcErrorRateHigh - 14:06 — "DB looks healthy, CPU is fine" (initial mis-diagnosis)
Summary
Payment service (payment-svc) is experiencing a severe outage with p99 latency spiking to 4.8s and 12% HTTP 500 errors, caused by a new cron job in v2.41 that runs a full table scan on ledger_entries every 30s, exhausting the Postgres connection pool. Customers are unable to complete checkouts.
Severity reasoning: User-facing outage with error rate >1% for >5 minutes (12% for ~10 minutes) and revenue path broken (failed checkouts).
deepseek-chat·prompt v2·output: en·8977ms·1739↑ / 1399↓ tok·$0.00201
Root cause hypotheses
- highNew cron job in payment-svc v2.41 runs SELECT * FROM ledger_entries WHERE status='pending' every 30s, causing full table scan and blocking connections.
Evidence: Deploy at 13:50 UTC added cron job; slow query log shows this query running every 30s with ~2.4s execution; active_connections at max (500).
- mediumConnection pool leak in payment-svc due to bug in v2.41, not releasing connections after query.
Evidence: Application logs show 'too many clients already' but no direct evidence of leak; could be secondary effect.
- lowPostgres primary overloaded by other services due to shared database.
Evidence: No evidence of other services; CPU is low (35%), only waiting queries are from the slow query.
Investigation checklist
- Check the cron job definition in the new deployment.
kubectl get cronjob -n prod -l app=payment-svc -o yaml | grep -A10 'schedule:'Expected: Should show schedule */30 * * * * and command containing 'ledger_entries'.
- Confirm the slow query is the one from the cron job.
kubectl logs -n prod -l app=payment-svc --since=15m | grep -E 'SELECT.*ledger_entries.*status' | head -5Expected: Log lines showing the query execution.
- Check current active connections and waiting queries on Postgres.
SELECT count(*) FROM pg_stat_activity WHERE state != 'idle'; SELECT count(*) FROM pg_stat_activity WHERE wait_event_type = 'Lock';Expected: Active connections near 500, waiting queries >0.
- Verify no other recent deployments or changes.
kubectl rollout history deployment/payment-svc -n prodExpected: Only revision 2.41 at 13:50 UTC.
Mitigation plan
Rollback payment-svc to previous version (v2.40) to remove the cron job.
Risk: Rollback may cause brief downtime; ensure no data loss from in-flight transactions.
Rollback: Re-deploy v2.41 by running: kubectl rollout undo deployment/payment-svc -n prod --to-revision=1
Kill the offending cron job if rollback is not immediate.
Risk: Cron job may be recreated by Kubernetes; temporary fix.
Rollback: Re-apply the cron job manifest if needed.
Add an index on ledger_entries.status to prevent full table scan.
Risk: Index creation may lock table briefly; test on replica first.
Rollback: DROP INDEX IF EXISTS idx_ledger_entries_status;
Customer impact
Customers are experiencing failed checkouts and payment processing errors. Approximately 12% of checkout attempts are failing with HTTP 500 errors. Affected users see an error page or timeout. No ETA yet.
Postmortem draft
Summary
[FILL IN]
Timeline (UTC)
- 13:50: Deploy payment-svc v2.41 with new cron job
- 14:02: p99 latency spikes to 4.8s, error rate rises
- 14:03: CS reports failed checkouts
- 14:04: Pager alert
- 14:06: Initial mis-diagnosis (DB CPU fine)
- [FILL IN] Rollback initiated
- [FILL IN] Incident resolved
Impact
- 12% error rate for ~[FILL IN] minutes
- Failed checkouts for [FILL IN] users
- Revenue impact: [FILL IN]
Root Cause
New cron job in v2.41 runs a full table scan on ledger_entries every 30s, exhausting Postgres connection pool.
Detection
Pager alert and customer reports.
Response
[FILL IN]
What Went Well
[FILL IN]
What Went Poorly
- Initial mis-diagnosis focused on DB CPU
- No pre-deploy review of cron job impact
- Missing index on status column
Action Items
- Add index on ledger_entries.status
- Add pre-deploy check for new cron jobs
- Improve monitoring for connection pool exhaustion
Follow-ups
- P0Add index on ledger_entries.status column— Database team
- P1Review and restrict cron job deployments to require approval— Platform team
- P1Add alert on active connections approaching max_connections— Observability team
- P2Update runbook for payment-svc with connection pool exhaustion steps— Service owner
- P1Conduct postmortem and document timeline— On-call SRE
Similar past incidents
lexical match (pg_trgm)
- 78%
[Eval][v2][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 76%
[Eval][v1][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 74%
[Scenario] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 47%
[Eval][v1][zh] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 42%
[Eval][v2][zh] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts