← All incidents

[Eval][v3][en] Payment service connection pool exhaustion after batch job deploy

service: payment-svccreated: 6/2/2026, 11:49:35 PM

Raw incident context

Time: 14:02 UTC. payment-svc p99 latency jumped from 120ms to 4.8s within ~3 minutes.
Error rate climbed from 0.1% to 12% (mostly HTTP 500).
Application logs (last 5min): repeated "FATAL: sorry, too many clients already" and "connection refused"
from payment-svc → postgres-primary.

Postgres metrics:
- CPU: 35% (normal)
- active_connections: 500 / 500 (max_connections)
- waiting_queries: 87
- slow_query_log shows a new query running every 30s: SELECT * FROM ledger_entries WHERE status='pending'
  (no index on status; full table scan over ~12M rows, ~2.4s per execution)

Deploy history:
- 13:50 UTC — payment-svc v2.41 shipped. Changelog mentions "added nightly settlement batch (cron: */30 * * * *)"
- No infra/DB changes in past 7 days.

On-call notes (Slack):
- 14:03 — CS reports failed checkout volume rising
- 14:04 — Pager: PaymentSvcErrorRateHigh
- 14:06 — "DB looks healthy, CPU is fine" (initial mis-diagnosis)

Summary

SEV1

Payment service p99 latency spiked from 120ms to 4.8s and error rate rose to 12% due to PostgreSQL connection exhaustion caused by a new cron job in payment-svc v2.41 that runs a full table scan on ledger_entries every 30 seconds, consuming all 500 max_connections.

Severity reasoning: User-facing outage: error rate > 1% for >5 min (12% for ~10 min), revenue path broken (failed checkouts). Matches SEV1 criteria.

deepseek-chat·prompt v3·output: en·10024ms·1990↑ / 1566↓ tok·$0.00226

Root cause hypotheses

  • highNew cron job in payment-svc v2.41 runs a full table scan on ledger_entries every 30s, consuming connections and blocking other queries.

    Evidence: Deploy at 13:50 UTC added cron '*/30 * * * *' for settlement batch. Slow query log shows SELECT * FROM ledger_entries WHERE status='pending' (no index, ~2.4s) running every 30s. active_connections at 500/500.

  • mediumConnection pool leak in payment-svc v2.41 causes connections not to be released after queries.

    Evidence: Application logs show 'too many clients already' and 'connection refused'. active_connections at max. No evidence of slow queries other than the cron job.

  • lowExternal traffic spike overwhelms payment-svc, causing connection exhaustion.

    Evidence: No evidence of traffic increase in context. Error rate and latency correlate with deploy time, not external factors.

Investigation checklist

  1. Check active connections and identify which queries are holding them open.
    SELECT pid, query_start, state, wait_event_type, query FROM pg_stat_activity WHERE state != 'idle' ORDER BY query_start LIMIT 20;

    Expected: Multiple connections running the slow query 'SELECT * FROM ledger_entries WHERE status='pending'' with long query_start times.

  2. Verify the cron job configuration in the latest deployment.
    kubectl get cronjob -n prod -l app=payment-svc -o yaml | grep -A5 'schedule'

    Expected: Schedule '*/30 * * * *' and command referencing settlement batch.

  3. Check if the slow query has an index on status column.
    SELECT tablename, indexname, indexdef FROM pg_indexes WHERE tablename='ledger_entries';

    Expected: No index on status column; indexdef does not include status.

  4. Monitor connection count and query performance after mitigation.
    watch -n 5 'psql -c "SELECT count(*) FROM pg_stat_activity WHERE state='active';"'

    Expected: Active connections drop below 500 after killing cron queries or rolling back.

Mitigation plan

  • Immediately kill the slow queries to free connections: SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE query LIKE '%ledger_entries%' AND state='active';

    Risk: Terminating queries may leave partial transactions; safe for SELECT queries. No data loss.

    Rollback: Re-run the cron job manually after indexing.

  • Roll back payment-svc to previous version (v2.40) to remove the cron job.

    Risk: Rollback may cause brief downtime during deployment. Safe as v2.40 was stable.

    Rollback: Re-deploy v2.41 after fixing the issue.

  • Add an index on ledger_entries.status to prevent full table scans.

    Risk: Index creation locks the table briefly; may cause short query delays. Safe with CONCURRENTLY.

    Rollback: DROP INDEX IF EXISTS idx_ledger_entries_status;

Customer impact

Customers are experiencing failed checkouts and errors when processing payments. Approximately 12% of payment attempts are failing with HTTP 500 errors. The issue started around 14:02 UTC and is ongoing.

Postmortem draft

Summary

[FILL IN: 2-3 sentence summary]

Timeline (UTC)

  • 13:50 - payment-svc v2.41 deployed with new cron job
  • 14:02 - p99 latency spikes to 4.8s, error rate rises
  • 14:03 - CS reports failed checkouts
  • 14:04 - Pager alert triggered
  • 14:06 - Initial mis-diagnosis: DB CPU fine
  • [FILL IN: mitigation steps and resolution time]

Impact

  • p99 latency: 120ms → 4.8s
  • Error rate: 0.1% → 12%
  • Affected users: [FILL IN: estimated count]

Root Cause

New cron job in payment-svc v2.41 runs a full table scan on ledger_entries (12M rows) every 30 seconds, consuming all 500 PostgreSQL connections and blocking legitimate queries.

Detection

Alert: PaymentSvcErrorRateHigh at 14:04 UTC. Detected via pager.

Response

  • Killed slow queries to free connections
  • Rolled back to v2.40
  • Added index on status column

What Went Well

  • Quick identification of slow query
  • Rollback was straightforward

What Went Poorly

  • Initial mis-diagnosis (DB CPU fine)
  • No pre-deploy review of cron job performance
  • Missing index on status column

Action Items

  • Add index on ledger_entries.status
  • Add connection pool monitoring alert
  • Add slow query detection alert
  • Review cron jobs for performance impact before deploy

Follow-ups

  • P0Add index on ledger_entries.status column to prevent full table scans.database-team
  • P1Set up alert on PostgreSQL active_connections approaching max_connections.on-call SRE
  • P1Add slow query detection and alerting (e.g., pg_stat_statements).platform-team
  • P2Implement pre-deploy performance review for new cron jobs (load test).service-owner
  • P2Update runbook for payment-svc connection exhaustion with specific commands.on-call SRE