[Scenario] Payment service connection pool exhaustion after batch job deploy

service: payment-svccreated: 5/25/2026, 8:43:25 PM

Raw incident context

Time: 14:02 UTC. payment-svc p99 latency jumped from 120ms to 4.8s within ~3 minutes.
Error rate climbed from 0.1% to 12% (mostly HTTP 500).
Application logs (last 5min): repeated "FATAL: sorry, too many clients already" and "connection refused"
from payment-svc → postgres-primary.

Postgres metrics:
- CPU: 35% (normal)
- active_connections: 500 / 500 (max_connections)
- waiting_queries: 87
- slow_query_log shows a new query running every 30s: SELECT * FROM ledger_entries WHERE status='pending'
  (no index on status; full table scan over ~12M rows, ~2.4s per execution)

Deploy history:
- 13:50 UTC — payment-svc v2.41 shipped. Changelog mentions "added nightly settlement batch (cron: */30 * * * *)"
- No infra/DB changes in past 7 days.

On-call notes (Slack):
- 14:03 — CS reports failed checkout volume rising
- 14:04 — Pager: PaymentSvcErrorRateHigh
- 14:06 — "DB looks healthy, CPU is fine" (initial mis-diagnosis)

Summary

SEV1

Payment service p99 latency spiked from 120ms to 4.8s and error rate rose to 12% due to a new cron job in v2.41 that runs a full table scan query every 30s, exhausting the Postgres connection pool (max_connections=500). Customers are experiencing failed checkouts.

Severity reasoning: User-facing outage with >1% error rate (12%) for >5 minutes and revenue path broken (failed checkouts).

deepseek-chat·prompt v2·output: en·12893ms

Root cause hypotheses

highNew cron job in payment-svc v2.41 runs SELECT * FROM ledger_entries WHERE status='pending' every 30s, causing full table scan and long-running queries that exhaust DB connections.
Evidence: Deploy at 13:50 UTC added cron job; slow query log shows this query running every 30s with ~2.4s execution; active_connections at max (500); app logs show 'too many clients already'.
mediumConnection pool leak in payment-svc v2.41 due to bug in new code, causing connections not to be released.
Evidence: Active connections at max; no other DB changes; but CPU is low, suggesting connections are idle or waiting.
lowExternal traffic spike causing increased DB connections.
Evidence: No traffic spike reported; error rate correlates with deploy time.

Investigation checklist

Check the slow query log for the specific query and its frequency.
```
kubectl exec -n prod postgres-primary-0 -- psql -U postgres -c "SELECT query, calls, total_time, rows, mean_time FROM pg_stat_statements ORDER BY total_time DESC LIMIT 10;"
```
Expected: The query 'SELECT * FROM ledger_entries WHERE status='pending'' should appear with high total_time and mean_time.
Verify the cron job configuration in the new deployment.
```
kubectl get cronjob -n prod -l app=payment-svc -o yaml | grep -A5 'schedule\|command'
```
Expected: A cron job with schedule '*/30 * * * *' and command containing 'ledger_entries' and 'status=pending'.

Check current active connections and waiting queries on Postgres.

kubectl exec -n prod postgres-primary-0 -- psql -U postgres -c "SELECT count(*) FROM pg_stat_activity WHERE state != 'idle'; SELECT count(*) FROM pg_stat_activity WHERE wait_event_type = 'Lock';"

Expected: Active connections near 500; waiting queries >0.

Check if the query is using an index on status column.
```
kubectl exec -n prod postgres-primary-0 -- psql -U postgres -c "\d+ ledger_entries" | grep -i 'status\|index'
```
Expected: No index on status column; full table scan expected.
Check payment-svc logs for connection errors.
```
kubectl logs -n prod -l app=payment-svc --since=15m | grep -iE 'too many clients|connection refused|FATAL' | tail -20
```
Expected: Repeated 'FATAL: sorry, too many clients already' and 'connection refused'.

Mitigation plan

Immediately scale down the cron job by suspending it or deleting the CronJob resource.
Risk: Nightly settlement batch will not run; financial reconciliation may be delayed. No data loss.
Rollback: Re-apply the original CronJob manifest from v2.41 or re-enable the cron job.
Increase max_connections on Postgres temporarily to 1000 to relieve pressure.
Risk: May cause memory pressure on DB; monitor memory usage. Safer alternative: first kill long-running queries.
Rollback: Set max_connections back to 500 and restart Postgres (requires downtime).
Kill the long-running queries that are blocking connections.
Risk: Terminating queries may cause incomplete transactions; no data corruption risk.
Rollback: No rollback needed; queries will restart if cron job is still active.

Customer impact

Customers are experiencing failed checkouts with HTTP 500 errors. Approximately 12% of checkout attempts are failing. Affected users see an error page and cannot complete purchases. No ETA yet.

Postmortem draft

Summary

[FILL IN]

Timeline (UTC)

13:50: payment-svc v2.41 deployed, adding a cron job for nightly settlement.
14:02: p99 latency spikes to 4.8s, error rate rises to 12%.
14:03: CS reports failed checkouts.
14:04: Pager alert triggered.
14:06: Initial mis-diagnosis: DB CPU looks fine.
[FILL IN]: Investigation identifies slow query and connection exhaustion.
[FILL IN]: Mitigation actions taken.

Impact

12% error rate on payment-svc for ~[FILL IN] minutes.
Failed checkouts for customers.
No data loss.

Root Cause

A new cron job in payment-svc v2.41 runs a full table scan query (SELECT * FROM ledger_entries WHERE status='pending') every 30 seconds, causing long-running queries that exhaust the Postgres connection pool (max_connections=500).

Detection

Pager alert for high error rate; customer reports.

Response

Suspended cron job.
Killed long-running queries.
Increased max_connections temporarily.

What Went Well

Quick identification of the slow query.
Effective rollback of cron job.

What Went Poorly

Initial mis-diagnosis (focused on CPU).
No pre-deployment testing of the cron job's query performance.
Missing index on status column.

Action Items

Add index on ledger_entries.status.
Add query performance checks in CI/CD.
Implement connection pool monitoring alerts.
Review cron job scheduling and resource usage.

Follow-ups

P0Add index on ledger_entries.status column to prevent full table scans.— Database team
P1Implement query performance regression tests in CI/CD pipeline for new deployments.— Platform team
P1Set up alerting on Postgres active connections approaching max_connections.— SRE team
P2Review cron job scheduling to avoid peak traffic times and add concurrency limits.— Service owner
P2Add runbook for connection pool exhaustion incident.— SRE team

Similar past incidents

lexical match (pg_trgm)