← All incidents

[Eval][v2][en] Payment service connection pool exhaustion after batch job deploy

service: payment-svccreated: 5/25/2026, 10:20:04 PM

Raw incident context

Time: 14:02 UTC. payment-svc p99 latency jumped from 120ms to 4.8s within ~3 minutes.
Error rate climbed from 0.1% to 12% (mostly HTTP 500).
Application logs (last 5min): repeated "FATAL: sorry, too many clients already" and "connection refused"
from payment-svc → postgres-primary.

Postgres metrics:
- CPU: 35% (normal)
- active_connections: 500 / 500 (max_connections)
- waiting_queries: 87
- slow_query_log shows a new query running every 30s: SELECT * FROM ledger_entries WHERE status='pending'
  (no index on status; full table scan over ~12M rows, ~2.4s per execution)

Deploy history:
- 13:50 UTC — payment-svc v2.41 shipped. Changelog mentions "added nightly settlement batch (cron: */30 * * * *)"
- No infra/DB changes in past 7 days.

On-call notes (Slack):
- 14:03 — CS reports failed checkout volume rising
- 14:04 — Pager: PaymentSvcErrorRateHigh
- 14:06 — "DB looks healthy, CPU is fine" (initial mis-diagnosis)

Summary

SEV1

Payment service p99 latency spiked from 120ms to 4.8s and error rate rose to 12% due to a new cron job in v2.41 exhausting the Postgres connection pool (max 500) with a slow full-table-scan query on ledger_entries. Customers are experiencing failed checkouts.

Severity reasoning: User-facing outage with error rate >1% for >5 min (12% for ~10 min) and revenue path broken (failed checkouts).

deepseek-chat·prompt v2·output: en·11153ms

Root cause hypotheses

  • highNew cron job in payment-svc v2.41 runs a slow query (SELECT * FROM ledger_entries WHERE status='pending') every 30s, consuming connections and blocking other queries.

    Evidence: Deploy at 13:50 UTC added cron job; slow query log shows this query running every 30s with ~2.4s execution; active_connections at max (500).

  • mediumConnection pool leak in payment-svc due to a bug in v2.41, causing connections not to be released.

    Evidence: Application logs show 'too many clients already' but no direct evidence of leak; connection count is exactly max_connections.

  • lowExternal traffic spike causing legitimate connections to exceed max_connections.

    Evidence: No traffic spike reported; error rate correlates with deploy time, not traffic pattern.

Investigation checklist

  1. Check the cron job definition in the latest deployment.
    kubectl get cronjob -n prod -l app=payment-svc -o yaml | grep -A5 'schedule\|command'

    Expected: A cron job with schedule */30 * * * * and a command that runs the slow query.

  2. Identify the exact query and its execution plan.
    SELECT query, calls, total_exec_time, rows, mean_exec_time FROM pg_stat_statements ORDER BY total_exec_time DESC LIMIT 10;

    Expected: The slow query with high total_exec_time and mean_exec_time > 2s.

  3. Check if the query has an index on status.
    \d+ ledger_entries

    Expected: No index on status column.

  4. Check current active connections and their states.
    SELECT state, count(*) FROM pg_stat_activity GROUP BY state;

    Expected: Many connections in 'active' state running the slow query.

  5. Check if the cron job is still running and its resource usage.
    kubectl get pods -n prod -l app=payment-svc --field-selector status.phase=Running | grep -i cron

    Expected: A pod with name containing 'cron' or 'settlement'.

Mitigation plan

  • Immediately scale down the cron job by suspending it: kubectl patch cronjob -n prod <cronjob-name> -p '{"spec":{"suspend":true}}'

    Risk: Nightly settlement batch will not run; financial reconciliation may be delayed. No data loss.

    Rollback: Resume the cron job: kubectl patch cronjob -n prod <cronjob-name> -p '{"spec":{"suspend":false}}'

  • Increase max_connections temporarily to 1000 to allow existing queries to complete and restore service.

    Risk: May increase DB memory pressure; monitor CPU and memory. Safe if DB has headroom.

    Rollback: Set max_connections back to 500: ALTER SYSTEM SET max_connections = 500; SELECT pg_reload_conf();

  • Kill the slow queries to free connections immediately.

    Risk: May cause partial data loss for the batch job; safe as it's a SELECT.

    Rollback: No rollback needed; queries can be re-run.

Customer impact

Customers are experiencing failed checkouts and payment processing errors. Approximately 12% of requests are failing with HTTP 500. Affected users see an error message and cannot complete purchases. Estimated impact: all users attempting checkout during the incident window.

Postmortem draft

Summary

[FILL IN]

Timeline (UTC)

  • 13:50 - Deploy payment-svc v2.41 with new cron job
  • 14:02 - p99 latency spikes to 4.8s, error rate rises
  • 14:03 - CS reports failed checkouts
  • 14:04 - Pager alert
  • 14:06 - Initial mis-diagnosis: DB CPU fine
  • [FILL IN] - Mitigation actions taken

Impact

  • p99 latency 4.8s (up from 120ms)
  • 12% error rate
  • Failed checkouts for customers

Root Cause

New cron job in v2.41 runs a full-table-scan query on ledger_entries every 30s, exhausting the Postgres connection pool (max 500) and blocking legitimate payment queries.

Detection

Pager alert for PaymentSvcErrorRateHigh at 14:04 UTC.

Response

  • Suspended cron job
  • Increased max_connections
  • Killed slow queries

What Went Well

  • Quick detection via monitoring
  • Team collaboration in Slack

What Went Poorly

  • Initial mis-diagnosis (DB CPU fine)
  • No pre-deploy review of cron job query performance
  • Missing index on status column

Action Items

  • [FILL IN]

Follow-ups

  • P0Add index on ledger_entries.status to prevent full table scans.Database team
  • P0Review and optimize the settlement batch query to use index or reduce frequency.Payment service team
  • P1Add pre-deploy performance regression tests for new queries (e.g., EXPLAIN ANALYZE in CI).Platform team
  • P1Set up alerting on connection pool usage (e.g., >80% max_connections).Observability team
  • P2Document runbook for Postgres connection exhaustion incident.SRE team