← All incidents

[Eval][v3][en] Payment service connection pool exhaustion after batch job deploy

service: payment-svccreated: 6/2/2026, 11:50:04 PM

Raw incident context

Time: 14:02 UTC. payment-svc p99 latency jumped from 120ms to 4.8s within ~3 minutes.
Error rate climbed from 0.1% to 12% (mostly HTTP 500).
Application logs (last 5min): repeated "FATAL: sorry, too many clients already" and "connection refused"
from payment-svc → postgres-primary.

Postgres metrics:
- CPU: 35% (normal)
- active_connections: 500 / 500 (max_connections)
- waiting_queries: 87
- slow_query_log shows a new query running every 30s: SELECT * FROM ledger_entries WHERE status='pending'
  (no index on status; full table scan over ~12M rows, ~2.4s per execution)

Deploy history:
- 13:50 UTC — payment-svc v2.41 shipped. Changelog mentions "added nightly settlement batch (cron: */30 * * * *)"
- No infra/DB changes in past 7 days.

On-call notes (Slack):
- 14:03 — CS reports failed checkout volume rising
- 14:04 — Pager: PaymentSvcErrorRateHigh
- 14:06 — "DB looks healthy, CPU is fine" (initial mis-diagnosis)

Summary

SEV1

Payment service p99 latency spiked from 120ms to 4.8s and error rate rose to 12% due to PostgreSQL connection pool exhaustion caused by a new cron job in v2.41 that runs a full table scan every 30s, maxing out connections at 500.

Severity reasoning: User-facing outage: error rate > 1% for >5 min (12% for ~10 min), revenue path broken (failed checkouts). Matches SEV1 criteria.

deepseek-chat·prompt v3·output: en·10682ms·1990↑ / 1615↓ tok·$0.00231

Root cause hypotheses

  • highNew cron job in payment-svc v2.41 runs a full table scan on ledger_entries every 30s, holding connections open and exhausting the pool.

    Evidence: Deploy at 13:50 UTC added cron '*/30 * * * *' for settlement batch. Slow query log shows SELECT * FROM ledger_entries WHERE status='pending' (no index, 2.4s, 12M rows). Active connections at 500/500.

  • mediumConnection pool leak in payment-svc v2.41 code changes.

    Evidence: No direct evidence of leak; connection count is exactly max_connections, suggesting pool exhaustion rather than leak. But code change could have introduced a leak.

  • lowExternal traffic spike causing more concurrent requests.

    Evidence: No traffic spike reported; latency and errors correlate with deploy time, not traffic.

Investigation checklist

  1. Check active connections and waiting queries on PostgreSQL.
    SELECT count(*) FROM pg_stat_activity WHERE state = 'active'; SELECT count(*) FROM pg_stat_activity WHERE wait_event_type = 'Lock';

    Expected: Active connections near 500; waiting queries > 0 confirms pool exhaustion.

  2. Identify the query consuming resources and its source.
    SELECT pid, query_start, state, wait_event_type, query FROM pg_stat_activity WHERE state != 'idle' ORDER BY query_start LIMIT 20;

    Expected: Multiple entries of SELECT * FROM ledger_entries WHERE status='pending' with long query_start.

  3. Check payment-svc logs for connection errors.
    kubectl logs -n prod -l app=payment-svc --since=15m | grep -iE 'too many clients|connection refused' | tail -20

    Expected: Repeated 'FATAL: sorry, too many clients already' and 'connection refused'.

  4. Verify the new cron job in the deployment.
    kubectl get cronjob -n prod -l app=payment-svc -o yaml | grep -A5 schedule

    Expected: Schedule: */30 * * * * for settlement-batch.

  5. Check if the query can be optimized with an index.
    SELECT tablename, indexname, indexdef FROM pg_indexes WHERE tablename = 'ledger_entries';

    Expected: No index on status column.

Mitigation plan

  • Immediately kill the slow queries to free up connections.

    Risk: May cause incomplete settlement batch; no data loss.

    Rollback: Re-run the batch manually after mitigation.

  • Temporarily increase max_connections to 1000 to relieve pressure.

    Risk: May increase DB memory usage; monitor CPU and memory.

    Rollback: ALTER SYSTEM SET max_connections=500; SELECT pg_reload_conf();

  • Rollback payment-svc to v2.40 to remove the cron job.

    Risk: Rollback may cause brief downtime; but safe as v2.40 was stable.

    Rollback: Re-deploy v2.41 after fixing the issue.

  • Add an index on ledger_entries.status to speed up the query.

    Risk: Index creation may lock table briefly; consider CONCURRENTLY.

    Rollback: DROP INDEX IF EXISTS idx_ledger_entries_status;

Customer impact

Customers are experiencing failed checkouts and errors when trying to complete payments. Approximately 12% of requests are failing with HTTP 500 errors. The issue started around 14:00 UTC and is ongoing.

Postmortem draft

Summary

Payment service experienced a SEV1 outage from 14:02 UTC to [FILL IN] due to a new cron job in v2.41 that exhausted PostgreSQL connections.

Timeline (UTC)

  • 13:50 - Deploy payment-svc v2.41 with settlement batch cron
  • 14:02 - p99 latency spikes to 4.8s, error rate climbs
  • 14:03 - CS reports failed checkouts
  • 14:04 - Pager alert
  • 14:06 - Initial mis-diagnosis: 'DB looks healthy'
  • [FILL IN] - Mitigation applied

Impact

  • p99 latency: 120ms → 4.8s
  • Error rate: 0.1% → 12%
  • Affected users: all customers attempting checkout

Root Cause

New cron job running full table scan on ledger_entries (12M rows) every 30s, holding connections and exhausting max_connections (500).

Detection

Pager alert for PaymentSvcErrorRateHigh. CS reports via Slack.

Response

  • Killed slow queries
  • Increased max_connections
  • Rolled back to v2.40
  • Added index on status

What Went Well

  • Quick identification of slow query
  • Rollback was straightforward

What Went Poorly

  • Initial mis-diagnosis delayed action
  • No index on status column
  • No load testing for new cron

Action Items

  • Add index on ledger_entries.status
  • Add connection pool monitoring alert
  • Add query performance regression tests in CI
  • Review cron job implementation for efficiency

Follow-ups

  • P0Add index on ledger_entries.status columndatabase-administrator
  • P1Set up alert on PostgreSQL active connections > 80% of max_connectionson-call SRE
  • P1Add query performance regression tests in CI for new cron jobsplatform-team
  • P2Review cron job implementation to use batch processing with limitspayments-platform
  • P2Update runbook for connection pool exhaustion with specific commandson-call SRE