[Eval][v3][en] Payment service connection pool exhaustion after batch job deploy
service: payment-svccreated: 6/2/2026, 11:49:49 PM
Raw incident context
Time: 14:02 UTC. payment-svc p99 latency jumped from 120ms to 4.8s within ~3 minutes. Error rate climbed from 0.1% to 12% (mostly HTTP 500). Application logs (last 5min): repeated "FATAL: sorry, too many clients already" and "connection refused" from payment-svc → postgres-primary. Postgres metrics: - CPU: 35% (normal) - active_connections: 500 / 500 (max_connections) - waiting_queries: 87 - slow_query_log shows a new query running every 30s: SELECT * FROM ledger_entries WHERE status='pending' (no index on status; full table scan over ~12M rows, ~2.4s per execution) Deploy history: - 13:50 UTC — payment-svc v2.41 shipped. Changelog mentions "added nightly settlement batch (cron: */30 * * * *)" - No infra/DB changes in past 7 days. On-call notes (Slack): - 14:03 — CS reports failed checkout volume rising - 14:04 — Pager: PaymentSvcErrorRateHigh - 14:06 — "DB looks healthy, CPU is fine" (initial mis-diagnosis)
Summary
Payment service p99 latency spiked from 120ms to 4.8s and error rate rose to 12% due to a new cron job in v2.41 that runs a full table scan on ledger_entries every 30s, exhausting the Postgres connection pool (max 500). Customers are experiencing failed checkouts.
Severity reasoning: User-facing outage: error rate > 1% for >5 min (12% for ~12 min), revenue path broken (failed checkouts). Matches SEV1 criteria.
deepseek-chat·prompt v3·output: en·11063ms·1990↑ / 1631↓ tok·$0.00233
Root cause hypotheses
- highNew cron job in payment-svc v2.41 runs a full table scan on ledger_entries every 30s, consuming connections and blocking other queries.
Evidence: Deploy at 13:50 UTC added cron '*/30 * * * *' for settlement batch. Slow query log shows SELECT * FROM ledger_entries WHERE status='pending' (no index, 2.4s, 12M rows). Active connections at 500/500.
- mediumConnection pool leak in payment-svc v2.41 causes connections not to be released after queries.
Evidence: Active connections at max, but no direct evidence of leak. Could be secondary factor.
- lowPostgres primary is under-provisioned for current load, leading to connection queueing.
Evidence: CPU at 35%, not high. max_connections=500 is a hard limit, not a performance issue.
- lowNetwork issue between payment-svc and Postgres causing connection timeouts.
Evidence: No network errors in logs; 'connection refused' is from Postgres rejecting new connections due to pool exhaustion.
Investigation checklist
- Check if the new cron job is running and its query plan.
kubectl logs -n prod -l app=payment-svc --since=30m | grep -i 'settlement\|cron\|batch' | tail -20Expected: Entries showing execution of settlement batch every 30s.
- Confirm the slow query and its impact on connections.
SELECT pid, query_start, state, wait_event_type, query FROM pg_stat_activity WHERE state != 'idle' AND query LIKE '%ledger_entries%' ORDER BY query_start LIMIT 20;Expected: Multiple queries with state 'active' and long query_start times.
- Check if the query is missing an index.
\d+ ledger_entriesExpected: No index on 'status' column.
- Verify connection pool exhaustion.
SELECT count(*) FROM pg_stat_activity WHERE state != 'idle';Expected: Count near 500.
- Check if rolling back the deploy resolves the issue.
kubectl rollout undo deployment/payment-svc -n prodExpected: Deployment rolled back to previous version.
Mitigation plan
Immediately roll back payment-svc to v2.40 to remove the cron job.
Risk: Brief increase in errors during rollback; no data loss.
Rollback: Re-apply v2.41 if needed after fix.
If rollback is not possible, kill the offending cron job process and disable the cron schedule.
Risk: May need to identify the specific process; could miss if multiple instances.
Rollback: Re-enable cron after fix.
Temporarily increase max_connections in Postgres to 1000 to relieve pressure.
Risk: May cause memory pressure on DB; monitor CPU and memory.
Rollback: Set back to 500 after mitigation.
Add an index on ledger_entries.status to speed up the query.
Risk: Index creation locks table briefly; could cause short blip.
Rollback: Drop index if needed.
Customer impact
Customers are experiencing failed checkouts and payment processing errors. Approximately 12% of checkout attempts are failing with HTTP 500 errors. The issue started around 14:02 UTC and is ongoing.
Postmortem draft
Summary
Payment service p99 latency spiked and error rate increased due to a new cron job introduced in v2.41 that performed a full table scan on ledger_entries, exhausting the Postgres connection pool.
Timeline (UTC)
- 13:50 - payment-svc v2.41 deployed with new cron job
- 14:02 - Latency and error rate spike detected
- 14:03 - Customer reports of failed checkouts
- 14:04 - Pager alert triggered
- [FILL IN] - Rollback initiated
- [FILL IN] - Service restored
Impact
- p99 latency: 4.8s (up from 120ms)
- Error rate: 12% (up from 0.1%)
- Affected users: All users attempting checkout during incident
Root Cause
A new cron job in payment-svc v2.41 ran a query without an index on ledger_entries.status, causing a full table scan every 30 seconds. This consumed all available database connections, blocking legitimate traffic.
Detection
Alert: PaymentSvcErrorRateHigh triggered at 14:04. Customer reports preceded alert.
Response
- On-call engineer investigated and identified the slow query
- Rollback of v2.41 performed
- Index added to prevent recurrence
What Went Well
- Quick identification of the slow query
- Rollback was straightforward
What Went Poorly
- Initial misdiagnosis that DB was healthy
- No pre-deployment performance testing for new cron jobs
Action Items
- Add index on ledger_entries.status
- Add runbook for connection pool exhaustion
- Implement pre-deployment query performance checks
Follow-ups
- P0Add index on ledger_entries.status to prevent full table scans.— database-administrator
- P1Implement pre-deployment performance testing for new cron jobs/queries.— platform-team
- P1Add monitoring and alerting for database connection pool usage.— observability-team
- P2Update runbook for connection pool exhaustion and slow query handling.— on-call SRE
- P2Review deployment process to include changelog review for performance impact.— platform-team
Similar past incidents
lexical match (pg_trgm)
- 89%
[Eval][v2][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 89%
[Scenario] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 77%
[Eval][v1][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 48%
[Eval][v1][zh] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 47%
[Eval][v2][zh] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts