← All incidents

[Eval][v3][en] Payment service connection pool exhaustion after batch job deploy

service: payment-svccreated: 6/2/2026, 11:36:05 PM

Raw incident context

Time: 14:02 UTC. payment-svc p99 latency jumped from 120ms to 4.8s within ~3 minutes.
Error rate climbed from 0.1% to 12% (mostly HTTP 500).
Application logs (last 5min): repeated "FATAL: sorry, too many clients already" and "connection refused"
from payment-svc → postgres-primary.

Postgres metrics:
- CPU: 35% (normal)
- active_connections: 500 / 500 (max_connections)
- waiting_queries: 87
- slow_query_log shows a new query running every 30s: SELECT * FROM ledger_entries WHERE status='pending'
  (no index on status; full table scan over ~12M rows, ~2.4s per execution)

Deploy history:
- 13:50 UTC — payment-svc v2.41 shipped. Changelog mentions "added nightly settlement batch (cron: */30 * * * *)"
- No infra/DB changes in past 7 days.

On-call notes (Slack):
- 14:03 — CS reports failed checkout volume rising
- 14:04 — Pager: PaymentSvcErrorRateHigh
- 14:06 — "DB looks healthy, CPU is fine" (initial mis-diagnosis)

Summary

SEV1

Payment service p99 latency spiked from 120ms to 4.8s and error rate rose to 12% due to a new cron job in v2.41 exhausting PostgreSQL connection pool (500/500) with a slow full-table-scan query on ledger_entries. Customers are experiencing failed checkouts.

Severity reasoning: User-facing outage: error rate > 1% for >5 min (12% for ~12 min), revenue path broken (failed checkouts). Matches SEV1 criteria.

deepseek-chat·prompt v3·output: en·10953ms·1990↑ / 1688↓ tok·$0.00239

Root cause hypotheses

  • highNew cron job in payment-svc v2.41 runs a slow query every 30s, exhausting DB connections.

    Evidence: Deploy at 13:50 UTC added cron '*/30 * * * *' for settlement batch. Slow query log shows SELECT * FROM ledger_entries WHERE status='pending' (no index, ~2.4s) running every 30s. active_connections at 500/500.

  • mediumConnection pool leak in payment-svc v2.41 due to bug in new code.

    Evidence: Application logs show 'too many clients already' and 'connection refused'. No other changes in 7 days. New code shipped at 13:50.

  • lowSudden traffic spike causing legitimate connections to exceed max_connections.

    Evidence: No evidence of traffic spike in context. Error rate and latency correlate with deploy time, not traffic.

Investigation checklist

  1. Check active connections and slow queries on PostgreSQL primary.
    SELECT count(*) FROM pg_stat_activity WHERE state != 'idle'; SELECT pid, query_start, state, wait_event_type, query FROM pg_stat_activity WHERE state != 'idle' ORDER BY query_start LIMIT 20;

    Expected: Count near 500; slow query with full table scan on ledger_entries.

  2. Verify the new cron job is running and its query.
    kubectl logs -n prod -l app=payment-svc --since=30m | grep -E 'settlement|cron|ledger_entries' | tail -20

    Expected: Logs showing cron execution and the slow query.

  3. Check if an index exists on ledger_entries.status.
    \d+ ledger_entries

    Expected: No index on status column.

  4. Check error rate and latency in monitoring.
    curl -s 'http://prometheus:9090/api/v1/query?query=rate(payment_svc_http_requests_total{status=~"5.."}[5m])' | jq .

    Expected: Error rate ~12%.

Mitigation plan

  • Immediately disable the cron job in payment-svc v2.41 by scaling down the cron deployment or removing the cron entry via ConfigMap.

    Risk: Settlement batch will not run until re-enabled; no data loss.

    Rollback: Re-apply the original ConfigMap or scale up the cron deployment.

  • Kill the slow queries currently running to free up connections.

    Risk: May cause incomplete settlement processing; safe as cron is disabled.

    Rollback: N/A (queries are read-only).

  • Increase max_connections temporarily (e.g., to 800) to restore service while investigating.

    Risk: May increase DB memory pressure; monitor CPU/memory.

    Rollback: Set max_connections back to 500.

  • Add an index on ledger_entries.status to prevent future slow queries.

    Risk: Index creation may lock table briefly; schedule during low traffic.

    Rollback: DROP INDEX IF EXISTS idx_ledger_entries_status;

Customer impact

Customers are experiencing failed checkouts and payment processing errors. Approximately 12% of requests are failing with HTTP 500 errors. The issue started around 14:02 UTC and is ongoing.

Postmortem draft

Summary

Payment service (payment-svc) experienced a severe degradation from 14:02 to [FILL IN] UTC, with p99 latency rising from 120ms to 4.8s and error rate reaching 12%. Root cause was a new cron job in v2.41 that executed a slow full-table-scan query every 30 seconds, exhausting the PostgreSQL connection pool.

Timeline (UTC)

  • 13:50 - payment-svc v2.41 deployed with new settlement cron.
  • 14:02 - Latency and error rate spike detected.
  • 14:03 - Customer reports of failed checkouts.
  • 14:04 - Pager alert triggered.
  • 14:06 - Initial mis-diagnosis: 'DB looks healthy'.
  • [FILL IN] - Cron job disabled, connections freed.
  • [FILL IN] - Service restored.

Impact

  • Error rate: 12% (HTTP 500).
  • Latency: p99 4.8s.
  • Affected users: All customers attempting checkouts during incident.
  • Revenue impact: [FILL IN].

Root Cause

A new cron job added in v2.41 ran SELECT * FROM ledger_entries WHERE status='pending' every 30 seconds. The query performed a full table scan over ~12M rows (no index on status), taking ~2.4s each. This consumed all 500 available database connections, causing new requests to fail with 'too many clients already'.

Detection

Alert PaymentSvcErrorRateHigh fired at 14:04. Initial on-call mis-diagnosed as DB health issue due to normal CPU.

Response

  • [FILL IN] steps taken.

What Went Well

  • Alerting worked.
  • Deploy history quickly identified.

What Went Poorly

  • Initial mis-diagnosis delayed mitigation.
  • No index on status column.
  • No pre-deploy testing for cron performance.

Action Items

  • Add index on ledger_entries.status.
  • Add connection pool monitoring alert.
  • Add slow query alerting.
  • Review cron job performance before deploy.

Follow-ups

  • P0Add index on ledger_entries.status to prevent full table scans.database-administrator
  • P1Implement connection pool monitoring and alert when usage > 80%.platform-team
  • P1Add slow query detection and alerting (e.g., pg_stat_statements).database-administrator
  • P2Add pre-deploy performance testing for new cron jobs in staging.service-owner
  • P2Review and improve incident response runbook for DB connection exhaustion.on-call SRE