[Eval][v2][en] Payment service connection pool exhaustion after batch job deploy
service: payment-svccreated: 5/25/2026, 10:20:04 PM
Raw incident context
Time: 14:02 UTC. payment-svc p99 latency jumped from 120ms to 4.8s within ~3 minutes. Error rate climbed from 0.1% to 12% (mostly HTTP 500). Application logs (last 5min): repeated "FATAL: sorry, too many clients already" and "connection refused" from payment-svc → postgres-primary. Postgres metrics: - CPU: 35% (normal) - active_connections: 500 / 500 (max_connections) - waiting_queries: 87 - slow_query_log shows a new query running every 30s: SELECT * FROM ledger_entries WHERE status='pending' (no index on status; full table scan over ~12M rows, ~2.4s per execution) Deploy history: - 13:50 UTC — payment-svc v2.41 shipped. Changelog mentions "added nightly settlement batch (cron: */30 * * * *)" - No infra/DB changes in past 7 days. On-call notes (Slack): - 14:03 — CS reports failed checkout volume rising - 14:04 — Pager: PaymentSvcErrorRateHigh - 14:06 — "DB looks healthy, CPU is fine" (initial mis-diagnosis)
Summary
Payment service p99 latency spiked from 120ms to 4.8s and error rate rose to 12% due to a new cron job in v2.41 exhausting the Postgres connection pool (max 500) with a slow full-table-scan query on ledger_entries. Customers are experiencing failed checkouts.
Severity reasoning: User-facing outage with error rate >1% for >5 min (12% for ~10 min) and revenue path broken (failed checkouts).
deepseek-chat·prompt v2·output: en·11153ms
Root cause hypotheses
- highNew cron job in payment-svc v2.41 runs a slow query (SELECT * FROM ledger_entries WHERE status='pending') every 30s, consuming connections and blocking other queries.
Evidence: Deploy at 13:50 UTC added cron job; slow query log shows this query running every 30s with ~2.4s execution; active_connections at max (500).
- mediumConnection pool leak in payment-svc due to a bug in v2.41, causing connections not to be released.
Evidence: Application logs show 'too many clients already' but no direct evidence of leak; connection count is exactly max_connections.
- lowExternal traffic spike causing legitimate connections to exceed max_connections.
Evidence: No traffic spike reported; error rate correlates with deploy time, not traffic pattern.
Investigation checklist
- Check the cron job definition in the latest deployment.
kubectl get cronjob -n prod -l app=payment-svc -o yaml | grep -A5 'schedule\|command'Expected: A cron job with schedule */30 * * * * and a command that runs the slow query.
- Identify the exact query and its execution plan.
SELECT query, calls, total_exec_time, rows, mean_exec_time FROM pg_stat_statements ORDER BY total_exec_time DESC LIMIT 10;Expected: The slow query with high total_exec_time and mean_exec_time > 2s.
- Check if the query has an index on status.
\d+ ledger_entriesExpected: No index on status column.
- Check current active connections and their states.
SELECT state, count(*) FROM pg_stat_activity GROUP BY state;Expected: Many connections in 'active' state running the slow query.
- Check if the cron job is still running and its resource usage.
kubectl get pods -n prod -l app=payment-svc --field-selector status.phase=Running | grep -i cronExpected: A pod with name containing 'cron' or 'settlement'.
Mitigation plan
Immediately scale down the cron job by suspending it: kubectl patch cronjob -n prod <cronjob-name> -p '{"spec":{"suspend":true}}'
Risk: Nightly settlement batch will not run; financial reconciliation may be delayed. No data loss.
Rollback: Resume the cron job: kubectl patch cronjob -n prod <cronjob-name> -p '{"spec":{"suspend":false}}'
Increase max_connections temporarily to 1000 to allow existing queries to complete and restore service.
Risk: May increase DB memory pressure; monitor CPU and memory. Safe if DB has headroom.
Rollback: Set max_connections back to 500: ALTER SYSTEM SET max_connections = 500; SELECT pg_reload_conf();
Kill the slow queries to free connections immediately.
Risk: May cause partial data loss for the batch job; safe as it's a SELECT.
Rollback: No rollback needed; queries can be re-run.
Customer impact
Customers are experiencing failed checkouts and payment processing errors. Approximately 12% of requests are failing with HTTP 500. Affected users see an error message and cannot complete purchases. Estimated impact: all users attempting checkout during the incident window.
Postmortem draft
Summary
[FILL IN]
Timeline (UTC)
- 13:50 - Deploy payment-svc v2.41 with new cron job
- 14:02 - p99 latency spikes to 4.8s, error rate rises
- 14:03 - CS reports failed checkouts
- 14:04 - Pager alert
- 14:06 - Initial mis-diagnosis: DB CPU fine
- [FILL IN] - Mitigation actions taken
Impact
- p99 latency 4.8s (up from 120ms)
- 12% error rate
- Failed checkouts for customers
Root Cause
New cron job in v2.41 runs a full-table-scan query on ledger_entries every 30s, exhausting the Postgres connection pool (max 500) and blocking legitimate payment queries.
Detection
Pager alert for PaymentSvcErrorRateHigh at 14:04 UTC.
Response
- Suspended cron job
- Increased max_connections
- Killed slow queries
What Went Well
- Quick detection via monitoring
- Team collaboration in Slack
What Went Poorly
- Initial mis-diagnosis (DB CPU fine)
- No pre-deploy review of cron job query performance
- Missing index on status column
Action Items
- [FILL IN]
Follow-ups
- P0Add index on ledger_entries.status to prevent full table scans.— Database team
- P0Review and optimize the settlement batch query to use index or reduce frequency.— Payment service team
- P1Add pre-deploy performance regression tests for new queries (e.g., EXPLAIN ANALYZE in CI).— Platform team
- P1Set up alerting on connection pool usage (e.g., >80% max_connections).— Observability team
- P2Document runbook for Postgres connection exhaustion incident.— SRE team
Similar past incidents
lexical match (pg_trgm)
- 84%
[Scenario] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 77%
[Eval][v1][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 48%
[Eval][v1][zh] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 46%
[Eval][v2][zh] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 39%
[datadog] checkout p99 8s
checkout p99 8s