[datadog] checkout p99 8s

service: checkoutcreated: 6/3/2026, 12:13:38 AM

Raw incident context

Datadog priority: P2
Env: prod
Tags: service:checkout, env:prod

p99 latency spiked 200ms→8s at 14:02 UTC, no recent deploy, redis_active_conns flatlined at pool max

Summary

SEV1

Checkout p99 latency spiked from 200ms to 8s at 14:02 UTC due to Redis connection pool exhaustion, likely caused by a downstream payment-svc batch job consuming all available connections. No recent deploy to checkout-svc, but payment-svc had a batch job introduced in v2.41 that previously caused a similar SEV1 incident. The latency spike is user-facing and exceeds the checkout-svc SLO of 500ms p99.

Severity reasoning: User-facing outage: checkout p99 latency 8s (16x SLO) and error rate likely >1% (based on previous similar incident). Revenue path broken as checkout failures impact payment processing. Matches SEV1 rubric: user-facing outage, error rate >1% for >5min, revenue path broken.

deepseek-chat·prompt v2·output: en·11700ms·3569↑ / 1609↓ tok·$0.00273

Root cause hypotheses

highPayment-svc batch job (v2.41) exhausting Postgres connection pool, causing payment-svc to hang and block checkout-svc threads
Evidence: Datadog shows redis_active_conns flatlined at pool max; previous incident (2026-03-12) had identical symptoms from a batch job consuming all 500 connections. No recent deploy to checkout-svc, but payment-svc may have been redeployed.
mediumRedis connection pool exhaustion due to a slow downstream call (e.g., Stripe) causing checkout-svc to hold Redis connections longer
Evidence: redis_active_conns at pool max; Stripe timeout is 30s and no circuit breaker on checkout-svc (as per service catalog). Could cause connection accumulation.
lowCheckout-svc thread pool exhaustion from a traffic spike or slow external dependency
Evidence: p99 latency spike to 8s; thread pool size is 200. Could be overwhelmed by increased traffic or slow responses from payment-svc or inventory-svc.

Investigation checklist

Check payment-svc error logs for connection refused or batch job activity
```
kubectl logs -n prod -l app=payment-svc --since=15m | grep -iE "ERROR|FATAL|too many clients|batch" | head -50
```
Expected: If batch job is culprit, see 'FATAL: sorry, too many clients already' or batch job log entries.

Check Postgres active connections and identify long-running queries

kubectl exec -n prod postgres-primary-0 -- psql -c "SELECT pid, query_start, query FROM pg_stat_activity WHERE state != 'idle' AND query_start < now() - interval '1 minute' ORDER BY query_start;"

Expected: If batch job, see recurring 'SELECT * FROM ledger_entries WHERE status='pending'' queries.

Check Redis connection pool usage from checkout-svc
```
kubectl exec -n prod deployment/checkout-svc -- redis-cli -h redis-checkout info clients | grep connected_clients
```
Expected: If pool exhausted, connected_clients near max (e.g., 80 per instance).
Check checkout-svc thread pool status via metrics or logs
```
kubectl logs -n prod -l app=checkout-svc --since=15m | grep -i "thread pool" | head -10
```
Expected: If thread pool exhausted, see 'thread pool full' or 'rejected execution'.
Check recent deploys to payment-svc in last 2 hours
```
kubectl rollout history deployment/payment-svc -n prod | tail -5
```
Expected: If a recent deploy (e.g., v2.41) is present, it may have introduced the batch job.

Mitigation plan

Kill long-running queries in Postgres to free connections immediately
Risk: May abort legitimate transactions; safe because payment-svc endpoints are idempotent. Blast radius: only affected queries.
Rollback: No rollback needed; killed queries can be retried.
Roll back payment-svc to previous version if batch job is confirmed
Risk: Rollback takes ~2 minutes; during that time, connections may remain exhausted. Blast radius: payment-svc only.
Rollback: Re-deploy the current version if rollback causes issues.
Increase Redis connection pool size temporarily if pool exhaustion is confirmed
Risk: May increase load on Redis; monitor memory. Blast radius: Redis instance.
Rollback: Revert pool size to original value.

Customer impact

Customers are experiencing checkout delays of up to 8 seconds, and some checkout attempts may fail. This affects all users attempting to complete purchases. Estimated impact: 12% of checkouts failing (based on previous similar incident). No ETA yet.

Postmortem draft

Summary

[FILL IN]

Timeline (UTC)

14:02 — p99 latency spike to 8s, alert triggered
[FILL IN] — On-call paged
[FILL IN] — Diagnosis: [FILL IN]
[FILL IN] — Mitigation applied
[FILL IN] — Latency returned to baseline

Impact

p99 latency 8s for checkout (SLO: 500ms)
Estimated error rate: [FILL IN]%
Affected users: all checkout traffic

Root Cause

[FILL IN]

Detection

Datadog alert on p99 latency spike

Response

[FILL IN]

What Went Well

[FILL IN]

What Went Poorly

[FILL IN]

Action Items

[FILL IN]

Follow-ups

P0Add circuit breaker to checkout-svc for downstream calls (Stripe, payment-svc)— storefront team
P0Fix payment-svc batch job to run nightly instead of every 30 seconds— payments-platform team
P1Add monitoring for Postgres active connections per service and alert on pool >80%— SRE team
P1Review and enforce runbook for rollback-first approach on payment-svc incidents— payments-platform team
P2Add Redis connection pool monitoring and alerting for checkout-svc— storefront team

Internal docs used by the AI

[1]serviceService catalog (extract)14%

# Service catalog (extract)

## payment-svc
- **Team:** payments-platform
- **Tier:** SEV1 (revenue-critical)
- **Upstream:** checkout-svc, retry-orchestrator
- **Downstream:** Stripe Connect (us-east-1), fraud-svc, audit-log
- **DB:** postgres-primary.payments (shared with subscription-svc, refund-svc)
- **Region:** us-east-1 primary, us-west-2 warm replica
- **Notes:** All endpoints idempotent. Safe to retry. Connection pool 80/instance.

## checkout-svc
- **Team:** storefront
- **Tier:** SEV1
- **Upstream:** web-frontend, mobile-api
- **Downstream:** payment-svc, inventory-svc, fraud-svc, Stripe Connect (direct, for some flows)
- **DB:** postgres-storefront (dedicated)
- **Region:** us-east-1, us-west-2 (active-active)
- **Notes:** Stripe timeout is 30s. No circuit breaker as of 2026-Q1 (planned for Q2). Thread pool size 200.

## order-svc
- **Team:** storefront
- **Tier:** SEV2 (order placement requires this but read-only views can degrade)
- **Upstream:** checkout-svc, mobile-api
- **Downstream:** inventory-svc, notification-svc
- **DB:** postgres-orders
- **Region:** us-east-1, us-west-2
- **Notes:** Memory limit 512Mi. Watch for unbounded in-process caches — has bitten us twice.

[2]serviceService catalog (extract)14%

DB:** postgres-orders
- **Region:** us-east-1, us-west-2
- **Notes:** Memory limit 512Mi. Watch for unbounded in-process caches — has bitten us twice.

## catalog-svc
- **Team:** storefront
- **Tier:** SEV2 (catalog is read-heavy, cached aggressively)
- **Upstream:** web-frontend, mobile-api
- **Downstream:** postgres-catalog, Redis cache cluster `cache-catalog`
- **Region:** us-east-1, us-west-2
- **Notes:** Cache pre-warmed nightly at 02:00 UTC, TTL 7h. **Known issue:** cache stampede when TTL expires at peak; mitigation via singleflight is planned (ticket SRE-2014). Add jitter to TTL as workaround.

## api-gateway
- **Team:** platform
- **Tier:** SEV1
- **Upstream:** internet (via CloudFront)
- **Downstream:** all services
- **Region:** all regions
- **Notes:** nginx upstream timeout 60s. DNS TTL for internal CNAMEs is 30s (was 300s before 2025-Q4 — be aware of cached IPs across pods).

## SLOs
| Service | Availability | Latency p99 |
|---|---|---|
| payment-svc | 99.95% | 300ms |
| checkout-svc | 99.95% | 500ms |
| order-svc | 99.9% | 1s |
| catalog-svc | 99.95% | 200ms (cached) |
| api-gateway | 99.99% | 50ms (passthrough) |

## On-call escalation
1. Service team (PagerDuty)
2. SRE on-call (15 min if no ack)
3. Engineering manager (30 min if no resolution)
4. VP Eng (60 min, SEV1 only)

[3]runbookRunbook: payment-svc13%

 grows monotonically, check for: in-process caches without eviction, request-id keyed maps, retained event listeners
- **Rollback first**, debug after

## SLO
- Availability: 99.95% (allows ~22min/month downtime)
- p99 latency: < 300ms (excluding Stripe call time)
- Error rate: < 0.1%

## Severity policy (overrides generic SEV rubric)
- Payment failure rate > 0.5% sustained 3min → **SEV1** (revenue impact)
- p99 > 1s for 10min → **SEV2**
- Single pod restart → not paged

## Useful commands
```bash
# Recent error breakdown
kubectl logs -n prod -l app=payment-svc --since=15m | grep -iE "ERROR|FATAL" | awk '{print $NF}' | sort | uniq -c | sort -rn | head

# Active DB connections by app
kubectl exec -n prod postgres-primary-0 -- psql -c "SELECT application_name, count(*) FROM pg_stat_activity GROUP BY 1 ORDER BY 2 DESC;"

# Force a rollback
kubectl rollout undo deployment/payment-svc -n prod
kubectl rollout status deployment/payment-svc -n prod
```

## Past incidents (most recent)
- 2026-03-12: SEV1, batch job v2.41 exhausted connection pool, 18min impact
- 2025-11-04: SEV2, slow Stripe response cascaded into thread exhaustion (fixed by adding circuit breaker)
- 2025-08-19: SEV1, OOM crashloop after upgrading json parser (in-process cache leak)

[4]runbookRunbook: payment-svc13%

# Runbook: payment-svc

## Owner
Team: payments-platform
Slack: #payments-oncall
PagerDuty: payments-svc-primary

## What it does
`payment-svc` processes checkout transactions. Sits between `checkout-svc` (upstream) and Stripe Connect (downstream). All requests are idempotent — safe to retry.

## Architecture quick facts
- Runs as Kubernetes deployment `payment-svc` in `prod` namespace
- 12 replicas, HPA min=8 max=30, target CPU 70%
- Memory limit 512Mi, request 256Mi
- Connects to `postgres-primary.payments` (max_connections=500 shared with 4 other services)
- Connection pool: pgbouncer in transaction mode, pool_size=80 per app instance

## Common failure modes

### "FATAL: sorry, too many clients already" + p99 spike
- **Almost always** a runaway batch job holding connections during a long query
- Check recent deploys (last 2h) for new cron jobs or batch operations
- Query: `SELECT pid, query_start, query FROM pg_stat_activity WHERE state != 'idle' AND query_start < now() - interval '1 minute' ORDER BY query_start;`
- **Mitigation**: kill the long-running query (`pg_terminate_backend(pid)`), THEN roll back the deploy
- **Do NOT** restart payment-svc pods — they'll thrash trying to reconnect to a saturated pool

### OOMKilled pods after deploy
- Memory profile must be flat under steady traffic
- If memory grows monotonically, check for: in-process caches without eviction, request-id keyed maps, retained event listeners
- **Rollback first**, debug after

[5]postmortemPostmortem: payment-svc DB connection pool exhaustion — 2026-03-1213%

# Postmortem: payment-svc DB connection pool exhaustion — 2026-03-12

**Severity:** SEV1
**Duration:** 18 minutes (14:02 – 14:20 UTC)
**Author:** Yan (on-call) · Reviewed by: payments-platform team

## Summary
A nightly settlement batch job introduced in payment-svc v2.41 began running every 30 seconds (intended: nightly), each invocation holding 8 database connections for ~2.4s while performing a full-table scan on `ledger_entries`. Within 3 minutes, the 500-connection Postgres pool was fully consumed, starving all foreground checkout traffic. 12% of checkouts failed during the window. Estimated revenue impact: $87k.

## Timeline (UTC)
- **13:50** — payment-svc v2.41 deployed (rolling, 100% by 13:54). Changelog mentioned "added nightly settlement batch".
- **14:02:08** — first alert: `PaymentSvcErrorRateHigh` (3% errors).
- **14:03:15** — CS reports failed checkout volume rising.
- **14:04** — On-call paged.
- **14:06** — Initial mis-diagnosis: "DB looks healthy, CPU is fine" — looked at CPU only, missed `active_connections` saturation.
- **14:11** — Correct diagnosis: connection pool exhausted (active_connections = 500/500).
- **14:14** — Identified batch job via `pg_stat_activity` showing recurring `SELECT * FROM ledger_entries WHERE status='pending'`.
- **14:16** — Rolled back to v2.40.
- **14:20** — Error rate returned to baseline.

Similar past incidents

lexical match (pg_trgm)