[Eval][v3][en] Payment service connection pool exhaustion after batch job deploy
service: payment-svccreated: 6/2/2026, 11:36:05 PM
Raw incident context
Time: 14:02 UTC. payment-svc p99 latency jumped from 120ms to 4.8s within ~3 minutes. Error rate climbed from 0.1% to 12% (mostly HTTP 500). Application logs (last 5min): repeated "FATAL: sorry, too many clients already" and "connection refused" from payment-svc → postgres-primary. Postgres metrics: - CPU: 35% (normal) - active_connections: 500 / 500 (max_connections) - waiting_queries: 87 - slow_query_log shows a new query running every 30s: SELECT * FROM ledger_entries WHERE status='pending' (no index on status; full table scan over ~12M rows, ~2.4s per execution) Deploy history: - 13:50 UTC — payment-svc v2.41 shipped. Changelog mentions "added nightly settlement batch (cron: */30 * * * *)" - No infra/DB changes in past 7 days. On-call notes (Slack): - 14:03 — CS reports failed checkout volume rising - 14:04 — Pager: PaymentSvcErrorRateHigh - 14:06 — "DB looks healthy, CPU is fine" (initial mis-diagnosis)
Summary
Payment service p99 latency spiked from 120ms to 4.8s and error rate rose to 12% due to a new cron job in v2.41 exhausting PostgreSQL connection pool (500/500) with a slow full-table-scan query on ledger_entries. Customers are experiencing failed checkouts.
Severity reasoning: User-facing outage: error rate > 1% for >5 min (12% for ~12 min), revenue path broken (failed checkouts). Matches SEV1 criteria.
deepseek-chat·prompt v3·output: en·10953ms·1990↑ / 1688↓ tok·$0.00239
Root cause hypotheses
- highNew cron job in payment-svc v2.41 runs a slow query every 30s, exhausting DB connections.
Evidence: Deploy at 13:50 UTC added cron '*/30 * * * *' for settlement batch. Slow query log shows SELECT * FROM ledger_entries WHERE status='pending' (no index, ~2.4s) running every 30s. active_connections at 500/500.
- mediumConnection pool leak in payment-svc v2.41 due to bug in new code.
Evidence: Application logs show 'too many clients already' and 'connection refused'. No other changes in 7 days. New code shipped at 13:50.
- lowSudden traffic spike causing legitimate connections to exceed max_connections.
Evidence: No evidence of traffic spike in context. Error rate and latency correlate with deploy time, not traffic.
Investigation checklist
- Check active connections and slow queries on PostgreSQL primary.
SELECT count(*) FROM pg_stat_activity WHERE state != 'idle'; SELECT pid, query_start, state, wait_event_type, query FROM pg_stat_activity WHERE state != 'idle' ORDER BY query_start LIMIT 20;Expected: Count near 500; slow query with full table scan on ledger_entries.
- Verify the new cron job is running and its query.
kubectl logs -n prod -l app=payment-svc --since=30m | grep -E 'settlement|cron|ledger_entries' | tail -20Expected: Logs showing cron execution and the slow query.
- Check if an index exists on ledger_entries.status.
\d+ ledger_entriesExpected: No index on status column.
- Check error rate and latency in monitoring.
curl -s 'http://prometheus:9090/api/v1/query?query=rate(payment_svc_http_requests_total{status=~"5.."}[5m])' | jq .Expected: Error rate ~12%.
Mitigation plan
Immediately disable the cron job in payment-svc v2.41 by scaling down the cron deployment or removing the cron entry via ConfigMap.
Risk: Settlement batch will not run until re-enabled; no data loss.
Rollback: Re-apply the original ConfigMap or scale up the cron deployment.
Kill the slow queries currently running to free up connections.
Risk: May cause incomplete settlement processing; safe as cron is disabled.
Rollback: N/A (queries are read-only).
Increase max_connections temporarily (e.g., to 800) to restore service while investigating.
Risk: May increase DB memory pressure; monitor CPU/memory.
Rollback: Set max_connections back to 500.
Add an index on ledger_entries.status to prevent future slow queries.
Risk: Index creation may lock table briefly; schedule during low traffic.
Rollback: DROP INDEX IF EXISTS idx_ledger_entries_status;
Customer impact
Customers are experiencing failed checkouts and payment processing errors. Approximately 12% of requests are failing with HTTP 500 errors. The issue started around 14:02 UTC and is ongoing.
Postmortem draft
Summary
Payment service (payment-svc) experienced a severe degradation from 14:02 to [FILL IN] UTC, with p99 latency rising from 120ms to 4.8s and error rate reaching 12%. Root cause was a new cron job in v2.41 that executed a slow full-table-scan query every 30 seconds, exhausting the PostgreSQL connection pool.
Timeline (UTC)
- 13:50 - payment-svc v2.41 deployed with new settlement cron.
- 14:02 - Latency and error rate spike detected.
- 14:03 - Customer reports of failed checkouts.
- 14:04 - Pager alert triggered.
- 14:06 - Initial mis-diagnosis: 'DB looks healthy'.
- [FILL IN] - Cron job disabled, connections freed.
- [FILL IN] - Service restored.
Impact
- Error rate: 12% (HTTP 500).
- Latency: p99 4.8s.
- Affected users: All customers attempting checkouts during incident.
- Revenue impact: [FILL IN].
Root Cause
A new cron job added in v2.41 ran SELECT * FROM ledger_entries WHERE status='pending' every 30 seconds. The query performed a full table scan over ~12M rows (no index on status), taking ~2.4s each. This consumed all 500 available database connections, causing new requests to fail with 'too many clients already'.
Detection
Alert PaymentSvcErrorRateHigh fired at 14:04. Initial on-call mis-diagnosed as DB health issue due to normal CPU.
Response
- [FILL IN] steps taken.
What Went Well
- Alerting worked.
- Deploy history quickly identified.
What Went Poorly
- Initial mis-diagnosis delayed mitigation.
- No index on status column.
- No pre-deploy testing for cron performance.
Action Items
- Add index on ledger_entries.status.
- Add connection pool monitoring alert.
- Add slow query alerting.
- Review cron job performance before deploy.
Follow-ups
- P0Add index on ledger_entries.status to prevent full table scans.— database-administrator
- P1Implement connection pool monitoring and alert when usage > 80%.— platform-team
- P1Add slow query detection and alerting (e.g., pg_stat_statements).— database-administrator
- P2Add pre-deploy performance testing for new cron jobs in staging.— service-owner
- P2Review and improve incident response runbook for DB connection exhaustion.— on-call SRE
Similar past incidents
lexical match (pg_trgm)
- 95%
[Eval][v2][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 79%
[Eval][v1][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 79%
[Scenario] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 50%
[Eval][v1][zh] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 46%
[Eval][v2][zh] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts