← All incidents

[Eval][v2][en] Payment service connection pool exhaustion after batch job deploy

service: payment-svccreated: 6/2/2026, 11:48:25 PM

Raw incident context

Time: 14:02 UTC. payment-svc p99 latency jumped from 120ms to 4.8s within ~3 minutes.
Error rate climbed from 0.1% to 12% (mostly HTTP 500).
Application logs (last 5min): repeated "FATAL: sorry, too many clients already" and "connection refused"
from payment-svc → postgres-primary.

Postgres metrics:
- CPU: 35% (normal)
- active_connections: 500 / 500 (max_connections)
- waiting_queries: 87
- slow_query_log shows a new query running every 30s: SELECT * FROM ledger_entries WHERE status='pending'
  (no index on status; full table scan over ~12M rows, ~2.4s per execution)

Deploy history:
- 13:50 UTC — payment-svc v2.41 shipped. Changelog mentions "added nightly settlement batch (cron: */30 * * * *)"
- No infra/DB changes in past 7 days.

On-call notes (Slack):
- 14:03 — CS reports failed checkout volume rising
- 14:04 — Pager: PaymentSvcErrorRateHigh
- 14:06 — "DB looks healthy, CPU is fine" (initial mis-diagnosis)

Summary

SEV1

Payment service p99 latency spiked from 120ms to 4.8s and error rate rose to 12% due to a new cron job in v2.41 that runs a full table scan on `ledger_entries` without an index, exhausting the Postgres connection pool. Customers are experiencing failed checkouts.

Severity reasoning: User-facing outage with error rate >1% for >5 minutes (12% for ~10 minutes) and revenue path broken (failed checkouts).

deepseek-chat·prompt v2·output: en·10429ms·1739↑ / 1585↓ tok·$0.00221

Root cause hypotheses

  • highNew cron job in payment-svc v2.41 runs SELECT * FROM ledger_entries WHERE status='pending' every 30s, causing full table scan and long-running queries that exhaust DB connections.

    Evidence: Deploy at 13:50 UTC added cron job; slow query log shows this query running every 30s with ~2.4s execution; active_connections at max (500) with 87 waiting queries.

  • mediumConnection pool leak in payment-svc due to a bug in v2.41, causing connections not to be returned to pool.

    Evidence: Active connections at max; but no evidence of leaked connections in logs (no 'connection not closed' warnings).

  • lowExternal traffic spike causing increased load on payment-svc and DB.

    Evidence: No traffic spike observed in metrics; latency and error rate correlate with deploy time, not traffic.

Investigation checklist

  1. Check if the cron job is the source of the slow query by examining its SQL and execution plan.
    kubectl exec -n prod deploy/payment-svc -- cat /app/cron/settlement.sh | grep -i 'SELECT.*ledger_entries'

    Expected: Should show the SELECT query with WHERE status='pending' and no index hint.

  2. Confirm the slow query is consuming connections by checking pg_stat_activity for the query.
    psql -h postgres-primary -U postgres -c "SELECT pid, query_start, state, wait_event_type, query FROM pg_stat_activity WHERE query LIKE '%ledger_entries%' AND state != 'idle' ORDER BY query_start;"

    Expected: Multiple rows with the same query, long query_start times, and state 'active'.

  3. Check if the cron job is scheduled to run every 30 minutes (as per changelog) and if it's currently running.
    kubectl get cronjob -n prod -l app=payment-svc -o jsonpath='{.items[*].spec.schedule}'

    Expected: Should show '*/30 * * * *' or similar.

  4. Verify the DB connection pool settings in payment-svc config.
    kubectl exec -n prod deploy/payment-svc -- cat /app/config.yaml | grep -A5 'connection_pool'

    Expected: Should show max pool size (likely 500) and timeout settings.

  5. Check if there are any other long-running queries or locks.
    psql -h postgres-primary -U postgres -c "SELECT pid, now() - pg_stat_activity.query_start AS duration, query, state FROM pg_stat_activity WHERE state != 'idle' ORDER BY duration DESC LIMIT 10;"

    Expected: The slow query should be at the top with duration >2s.

Mitigation plan

  • Immediately disable the cron job by scaling the cronjob to 0 replicas or suspending it.

    Risk: Nightly settlement batch will not run; may cause delayed reconciliation but no data loss.

    Rollback: Re-enable cron job by scaling back to 1 replica or unsuspending.

  • Kill the long-running queries to free up connections.

    Risk: May cause incomplete settlement processing; safe as queries are SELECT only.

    Rollback: No rollback needed; queries can be re-run later.

  • Add an index on ledger_entries.status to prevent full table scans.

    Risk: Index creation may lock table briefly; consider CONCURRENTLY to avoid downtime.

    Rollback: Drop the index if it causes issues.

Customer impact

Customers are experiencing failed checkouts and payment processing errors. Approximately 12% of checkout attempts are failing with HTTP 500 errors. Affected users see an error message and cannot complete purchases. No ETA for full recovery yet.

Postmortem draft

Summary

[FILL IN]

Timeline (UTC)

  • 13:50: Deploy payment-svc v2.41 with new cron job
  • 14:02: p99 latency spikes to 4.8s, error rate rises
  • 14:03: CS reports failed checkouts
  • 14:04: Pager alert
  • 14:06: Initial mis-diagnosis: DB CPU fine
  • [FILL IN]: Root cause identified
  • [FILL IN]: Mitigation applied

Impact

  • 12% error rate on payment-svc for ~10 minutes
  • Failed checkouts for customers
  • No data loss

Root Cause

New cron job in v2.41 runs a full table scan on ledger_entries without an index, exhausting DB connection pool.

Detection

Pager alert for high error rate; customer reports.

Response

  • Disabled cron job
  • Killed long-running queries
  • Added index

What Went Well

  • Quick deploy rollback capability
  • DB metrics available

What Went Poorly

  • Initial mis-diagnosis (DB CPU fine)
  • No pre-deploy review of cron job SQL
  • Missing index on status column

Action Items

  • [FILL IN]

Follow-ups

  • P0Add index on ledger_entries.status to prevent full table scans.Database team
  • P1Review and test all cron jobs in staging before production deploy.Service owner
  • P1Add monitoring for long-running queries and connection pool saturation.Platform team
  • P2Implement pre-deploy SQL review for any new queries in deployments.Service owner
  • P2Update runbook for payment-svc to include steps for connection pool exhaustion.On-call SRE