[Scenario] Payment service connection pool exhaustion after batch job deploy
service: payment-svccreated: 5/25/2026, 8:43:25 PM
Raw incident context
Time: 14:02 UTC. payment-svc p99 latency jumped from 120ms to 4.8s within ~3 minutes. Error rate climbed from 0.1% to 12% (mostly HTTP 500). Application logs (last 5min): repeated "FATAL: sorry, too many clients already" and "connection refused" from payment-svc → postgres-primary. Postgres metrics: - CPU: 35% (normal) - active_connections: 500 / 500 (max_connections) - waiting_queries: 87 - slow_query_log shows a new query running every 30s: SELECT * FROM ledger_entries WHERE status='pending' (no index on status; full table scan over ~12M rows, ~2.4s per execution) Deploy history: - 13:50 UTC — payment-svc v2.41 shipped. Changelog mentions "added nightly settlement batch (cron: */30 * * * *)" - No infra/DB changes in past 7 days. On-call notes (Slack): - 14:03 — CS reports failed checkout volume rising - 14:04 — Pager: PaymentSvcErrorRateHigh - 14:06 — "DB looks healthy, CPU is fine" (initial mis-diagnosis)
Summary
Payment service p99 latency spiked from 120ms to 4.8s and error rate rose to 12% due to a new cron job in v2.41 that runs a full table scan query every 30s, exhausting the Postgres connection pool (max_connections=500). Customers are experiencing failed checkouts.
Severity reasoning: User-facing outage with >1% error rate (12%) for >5 minutes and revenue path broken (failed checkouts).
deepseek-chat·prompt v2·output: en·12893ms
Root cause hypotheses
- highNew cron job in payment-svc v2.41 runs SELECT * FROM ledger_entries WHERE status='pending' every 30s, causing full table scan and long-running queries that exhaust DB connections.
Evidence: Deploy at 13:50 UTC added cron job; slow query log shows this query running every 30s with ~2.4s execution; active_connections at max (500); app logs show 'too many clients already'.
- mediumConnection pool leak in payment-svc v2.41 due to bug in new code, causing connections not to be released.
Evidence: Active connections at max; no other DB changes; but CPU is low, suggesting connections are idle or waiting.
- lowExternal traffic spike causing increased DB connections.
Evidence: No traffic spike reported; error rate correlates with deploy time.
Investigation checklist
- Check the slow query log for the specific query and its frequency.
kubectl exec -n prod postgres-primary-0 -- psql -U postgres -c "SELECT query, calls, total_time, rows, mean_time FROM pg_stat_statements ORDER BY total_time DESC LIMIT 10;"Expected: The query 'SELECT * FROM ledger_entries WHERE status='pending'' should appear with high total_time and mean_time.
- Verify the cron job configuration in the new deployment.
kubectl get cronjob -n prod -l app=payment-svc -o yaml | grep -A5 'schedule\|command'Expected: A cron job with schedule '*/30 * * * *' and command containing 'ledger_entries' and 'status=pending'.
- Check current active connections and waiting queries on Postgres.
kubectl exec -n prod postgres-primary-0 -- psql -U postgres -c "SELECT count(*) FROM pg_stat_activity WHERE state != 'idle'; SELECT count(*) FROM pg_stat_activity WHERE wait_event_type = 'Lock';"Expected: Active connections near 500; waiting queries >0.
- Check if the query is using an index on status column.
kubectl exec -n prod postgres-primary-0 -- psql -U postgres -c "\d+ ledger_entries" | grep -i 'status\|index'Expected: No index on status column; full table scan expected.
- Check payment-svc logs for connection errors.
kubectl logs -n prod -l app=payment-svc --since=15m | grep -iE 'too many clients|connection refused|FATAL' | tail -20Expected: Repeated 'FATAL: sorry, too many clients already' and 'connection refused'.
Mitigation plan
Immediately scale down the cron job by suspending it or deleting the CronJob resource.
Risk: Nightly settlement batch will not run; financial reconciliation may be delayed. No data loss.
Rollback: Re-apply the original CronJob manifest from v2.41 or re-enable the cron job.
Increase max_connections on Postgres temporarily to 1000 to relieve pressure.
Risk: May cause memory pressure on DB; monitor memory usage. Safer alternative: first kill long-running queries.
Rollback: Set max_connections back to 500 and restart Postgres (requires downtime).
Kill the long-running queries that are blocking connections.
Risk: Terminating queries may cause incomplete transactions; no data corruption risk.
Rollback: No rollback needed; queries will restart if cron job is still active.
Customer impact
Customers are experiencing failed checkouts with HTTP 500 errors. Approximately 12% of checkout attempts are failing. Affected users see an error page and cannot complete purchases. No ETA yet.
Postmortem draft
Summary
[FILL IN]
Timeline (UTC)
- 13:50: payment-svc v2.41 deployed, adding a cron job for nightly settlement.
- 14:02: p99 latency spikes to 4.8s, error rate rises to 12%.
- 14:03: CS reports failed checkouts.
- 14:04: Pager alert triggered.
- 14:06: Initial mis-diagnosis: DB CPU looks fine.
- [FILL IN]: Investigation identifies slow query and connection exhaustion.
- [FILL IN]: Mitigation actions taken.
Impact
- 12% error rate on payment-svc for ~[FILL IN] minutes.
- Failed checkouts for customers.
- No data loss.
Root Cause
A new cron job in payment-svc v2.41 runs a full table scan query (SELECT * FROM ledger_entries WHERE status='pending') every 30 seconds, causing long-running queries that exhaust the Postgres connection pool (max_connections=500).
Detection
Pager alert for high error rate; customer reports.
Response
- Suspended cron job.
- Killed long-running queries.
- Increased max_connections temporarily.
What Went Well
- Quick identification of the slow query.
- Effective rollback of cron job.
What Went Poorly
- Initial mis-diagnosis (focused on CPU).
- No pre-deployment testing of the cron job's query performance.
- Missing index on status column.
Action Items
- Add index on ledger_entries.status.
- Add query performance checks in CI/CD.
- Implement connection pool monitoring alerts.
- Review cron job scheduling and resource usage.
Follow-ups
- P0Add index on ledger_entries.status column to prevent full table scans.— Database team
- P1Implement query performance regression tests in CI/CD pipeline for new deployments.— Platform team
- P1Set up alerting on Postgres active connections approaching max_connections.— SRE team
- P2Review cron job scheduling to avoid peak traffic times and add concurrency limits.— Service owner
- P2Add runbook for connection pool exhaustion incident.— SRE team
Similar past incidents
lexical match (pg_trgm)
- 84%
[Eval][v2][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 73%
[Eval][v1][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 46%
[Eval][v2][zh] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 44%
[Eval][v1][zh] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 40%
[datadog] checkout p99 8s
checkout p99 8s