← All incidents

[Eval][v2][en] Payment service connection pool exhaustion after batch job deploy

service: payment-svccreated: 6/2/2026, 11:35:28 PM

Raw incident context

Time: 14:02 UTC. payment-svc p99 latency jumped from 120ms to 4.8s within ~3 minutes.
Error rate climbed from 0.1% to 12% (mostly HTTP 500).
Application logs (last 5min): repeated "FATAL: sorry, too many clients already" and "connection refused"
from payment-svc → postgres-primary.

Postgres metrics:
- CPU: 35% (normal)
- active_connections: 500 / 500 (max_connections)
- waiting_queries: 87
- slow_query_log shows a new query running every 30s: SELECT * FROM ledger_entries WHERE status='pending'
  (no index on status; full table scan over ~12M rows, ~2.4s per execution)

Deploy history:
- 13:50 UTC — payment-svc v2.41 shipped. Changelog mentions "added nightly settlement batch (cron: */30 * * * *)"
- No infra/DB changes in past 7 days.

On-call notes (Slack):
- 14:03 — CS reports failed checkout volume rising
- 14:04 — Pager: PaymentSvcErrorRateHigh
- 14:06 — "DB looks healthy, CPU is fine" (initial mis-diagnosis)

Summary

SEV1

Payment service p99 latency spiked from 120ms to 4.8s and error rate rose to 12% due to a new cron job in v2.41 exhausting the Postgres connection pool (max 500) with a full table scan query on ledger_entries. Customers are experiencing failed checkouts.

Severity reasoning: User-facing outage with >1% error rate (12%) for >5 minutes, revenue path broken (failed checkouts).

deepseek-chat·prompt v2·output: en·10567ms·1739↑ / 1480↓ tok·$0.00210

Root cause hypotheses

  • highNew cron job in payment-svc v2.41 runs a slow query every 30s, consuming connections and blocking others.

    Evidence: Deploy at 13:50 UTC added cron '*/30 * * * *' for settlement batch; slow query log shows SELECT * FROM ledger_entries WHERE status='pending' (no index, 2.4s) running every 30s; active_connections at max 500.

  • mediumConnection pool leak in payment-svc due to unclosed connections from the new code.

    Evidence: Application logs show 'too many clients already' but no explicit leak evidence; connection count is exactly max_connections.

  • lowExternal traffic spike causing increased connection demand.

    Evidence: No traffic spike observed; error rate correlates with deploy time, not traffic patterns.

Investigation checklist

  1. Check if the slow query is from the new cron job by examining the application logs for the cron execution.
    kubectl logs -n prod -l app=payment-svc --since=30m | grep -E 'settlement|cron|ledger_entries' | head -20

    Expected: Lines showing cron job execution and the slow query being logged.

  2. Identify the source of connections consuming the pool.
    SELECT pid, application_name, client_addr, query, state, query_start FROM pg_stat_activity WHERE state != 'idle' ORDER BY query_start LIMIT 20;

    Expected: Multiple connections running the slow SELECT * FROM ledger_entries query, all from payment-svc pods.

  3. Check if the slow query can be terminated to free connections.
    SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE query LIKE '%ledger_entries%' AND state = 'active';

    Expected: Terminates the blocking queries; connections freed.

  4. Verify the deploy manifest for the cron job configuration.
    kubectl get deployment payment-svc -n prod -o yaml | grep -A 10 'cron'

    Expected: Cron schedule and command for the settlement batch.

Mitigation plan

  • Immediately terminate the slow queries to free connections and restore service.

    Risk: Terminating queries may leave ledger_entries in an inconsistent state; the batch may need to be re-run.

    Rollback: If service does not recover, restart payment-svc pods to reset connection pools.

  • Rollback payment-svc to previous version v2.40 to remove the cron job.

    Risk: Rollback may cause brief downtime during deployment; ensure no data loss from the batch.

    Rollback: Re-deploy v2.41 if rollback causes issues, but first disable the cron via feature flag.

  • Add an index on ledger_entries.status to prevent full table scans.

    Risk: Index creation locks the table briefly; may cause short latency spike.

    Rollback: Drop the index if it causes performance degradation.

Customer impact

Customers are experiencing failed checkouts and payment processing errors. Approximately 12% of requests are failing with HTTP 500. Affected users see 'payment failed' messages. Estimated impact: thousands of users over the last 15 minutes.

Postmortem draft

Summary

[FILL IN]

Timeline (UTC)

  • 13:50: Deploy payment-svc v2.41 with new cron job
  • 14:02: p99 latency spikes to 4.8s, error rate rises
  • 14:03: CS reports failed checkouts
  • 14:04: Pager alert
  • 14:06: Initial mis-diagnosis (DB CPU fine)
  • [FILL IN] Mitigation actions

Impact

  • 12% error rate on payment-svc for ~15 minutes
  • Failed checkouts for customers

Root Cause

New cron job in v2.41 runs a full table scan query every 30s on ledger_entries (no index on status), exhausting Postgres connection pool.

Detection

Pager alert and customer reports.

Response

[FILL IN]

What Went Well

[FILL IN]

What Went Poorly

  • Initial mis-diagnosis (CPU fine ≠ DB healthy)
  • No pre-deploy review of new cron job
  • Missing index on ledger_entries.status

Action Items

  • Add index on ledger_entries.status
  • Add connection pool monitoring alert
  • Add slow query alert for queries >1s
  • Review deploy process for cron jobs
  • Add feature flag for new batch jobs

Follow-ups

  • P0Add index on ledger_entries.status columnDatabase team
  • P1Set up alert on Postgres active_connections > 80% maxPlatform team
  • P1Add slow query monitoring and alert for queries >1sObservability team
  • P2Review deploy process to require approval for new cron jobsService owner
  • P1Implement feature flag for settlement batch to disable without rollbackPayment-svc team
  • P2Update runbook for connection pool exhaustionOn-call SRE