[datadog] checkout p99 8s
service: checkoutcreated: 6/3/2026, 12:13:38 AM
Raw incident context
Datadog priority: P2 Env: prod Tags: service:checkout, env:prod p99 latency spiked 200ms→8s at 14:02 UTC, no recent deploy, redis_active_conns flatlined at pool max
Summary
Checkout p99 latency spiked from 200ms to 8s at 14:02 UTC due to Redis connection pool exhaustion, likely caused by a downstream payment-svc batch job consuming all available connections. No recent deploy to checkout-svc, but payment-svc had a batch job introduced in v2.41 that previously caused a similar SEV1 incident. The latency spike is user-facing and exceeds the checkout-svc SLO of 500ms p99.
Severity reasoning: User-facing outage: checkout p99 latency 8s (16x SLO) and error rate likely >1% (based on previous similar incident). Revenue path broken as checkout failures impact payment processing. Matches SEV1 rubric: user-facing outage, error rate >1% for >5min, revenue path broken.
deepseek-chat·prompt v2·output: en·11700ms·3569↑ / 1609↓ tok·$0.00273
Root cause hypotheses
- highPayment-svc batch job (v2.41) exhausting Postgres connection pool, causing payment-svc to hang and block checkout-svc threads
Evidence: Datadog shows redis_active_conns flatlined at pool max; previous incident (2026-03-12) had identical symptoms from a batch job consuming all 500 connections. No recent deploy to checkout-svc, but payment-svc may have been redeployed.
- mediumRedis connection pool exhaustion due to a slow downstream call (e.g., Stripe) causing checkout-svc to hold Redis connections longer
Evidence: redis_active_conns at pool max; Stripe timeout is 30s and no circuit breaker on checkout-svc (as per service catalog). Could cause connection accumulation.
- lowCheckout-svc thread pool exhaustion from a traffic spike or slow external dependency
Evidence: p99 latency spike to 8s; thread pool size is 200. Could be overwhelmed by increased traffic or slow responses from payment-svc or inventory-svc.
Investigation checklist
- Check payment-svc error logs for connection refused or batch job activity
kubectl logs -n prod -l app=payment-svc --since=15m | grep -iE "ERROR|FATAL|too many clients|batch" | head -50Expected: If batch job is culprit, see 'FATAL: sorry, too many clients already' or batch job log entries.
- Check Postgres active connections and identify long-running queries
kubectl exec -n prod postgres-primary-0 -- psql -c "SELECT pid, query_start, query FROM pg_stat_activity WHERE state != 'idle' AND query_start < now() - interval '1 minute' ORDER BY query_start;"Expected: If batch job, see recurring 'SELECT * FROM ledger_entries WHERE status='pending'' queries.
- Check Redis connection pool usage from checkout-svc
kubectl exec -n prod deployment/checkout-svc -- redis-cli -h redis-checkout info clients | grep connected_clientsExpected: If pool exhausted, connected_clients near max (e.g., 80 per instance).
- Check checkout-svc thread pool status via metrics or logs
kubectl logs -n prod -l app=checkout-svc --since=15m | grep -i "thread pool" | head -10Expected: If thread pool exhausted, see 'thread pool full' or 'rejected execution'.
- Check recent deploys to payment-svc in last 2 hours
kubectl rollout history deployment/payment-svc -n prod | tail -5Expected: If a recent deploy (e.g., v2.41) is present, it may have introduced the batch job.
Mitigation plan
Kill long-running queries in Postgres to free connections immediately
Risk: May abort legitimate transactions; safe because payment-svc endpoints are idempotent. Blast radius: only affected queries.
Rollback: No rollback needed; killed queries can be retried.
Roll back payment-svc to previous version if batch job is confirmed
Risk: Rollback takes ~2 minutes; during that time, connections may remain exhausted. Blast radius: payment-svc only.
Rollback: Re-deploy the current version if rollback causes issues.
Increase Redis connection pool size temporarily if pool exhaustion is confirmed
Risk: May increase load on Redis; monitor memory. Blast radius: Redis instance.
Rollback: Revert pool size to original value.
Customer impact
Customers are experiencing checkout delays of up to 8 seconds, and some checkout attempts may fail. This affects all users attempting to complete purchases. Estimated impact: 12% of checkouts failing (based on previous similar incident). No ETA yet.
Postmortem draft
Summary
[FILL IN]
Timeline (UTC)
- 14:02 — p99 latency spike to 8s, alert triggered
- [FILL IN] — On-call paged
- [FILL IN] — Diagnosis: [FILL IN]
- [FILL IN] — Mitigation applied
- [FILL IN] — Latency returned to baseline
Impact
- p99 latency 8s for checkout (SLO: 500ms)
- Estimated error rate: [FILL IN]%
- Affected users: all checkout traffic
Root Cause
[FILL IN]
Detection
Datadog alert on p99 latency spike
Response
[FILL IN]
What Went Well
[FILL IN]
What Went Poorly
[FILL IN]
Action Items
- [FILL IN]
Follow-ups
- P0Add circuit breaker to checkout-svc for downstream calls (Stripe, payment-svc)— storefront team
- P0Fix payment-svc batch job to run nightly instead of every 30 seconds— payments-platform team
- P1Add monitoring for Postgres active connections per service and alert on pool >80%— SRE team
- P1Review and enforce runbook for rollback-first approach on payment-svc incidents— payments-platform team
- P2Add Redis connection pool monitoring and alerting for checkout-svc— storefront team
Internal docs used by the AI
- [1]serviceService catalog (extract)14%
# Service catalog (extract) ## payment-svc - **Team:** payments-platform - **Tier:** SEV1 (revenue-critical) - **Upstream:** checkout-svc, retry-orchestrator - **Downstream:** Stripe Connect (us-east-1), fraud-svc, audit-log - **DB:** postgres-primary.payments (shared with subscription-svc, refund-svc) - **Region:** us-east-1 primary, us-west-2 warm replica - **Notes:** All endpoints idempotent. Safe to retry. Connection pool 80/instance. ## checkout-svc - **Team:** storefront - **Tier:** SEV1 - **Upstream:** web-frontend, mobile-api - **Downstream:** payment-svc, inventory-svc, fraud-svc, Stripe Connect (direct, for some flows) - **DB:** postgres-storefront (dedicated) - **Region:** us-east-1, us-west-2 (active-active) - **Notes:** Stripe timeout is 30s. No circuit breaker as of 2026-Q1 (planned for Q2). Thread pool size 200. ## order-svc - **Team:** storefront - **Tier:** SEV2 (order placement requires this but read-only views can degrade) - **Upstream:** checkout-svc, mobile-api - **Downstream:** inventory-svc, notification-svc - **DB:** postgres-orders - **Region:** us-east-1, us-west-2 - **Notes:** Memory limit 512Mi. Watch for unbounded in-process caches — has bitten us twice.
- [2]serviceService catalog (extract)14%
DB:** postgres-orders - **Region:** us-east-1, us-west-2 - **Notes:** Memory limit 512Mi. Watch for unbounded in-process caches — has bitten us twice. ## catalog-svc - **Team:** storefront - **Tier:** SEV2 (catalog is read-heavy, cached aggressively) - **Upstream:** web-frontend, mobile-api - **Downstream:** postgres-catalog, Redis cache cluster `cache-catalog` - **Region:** us-east-1, us-west-2 - **Notes:** Cache pre-warmed nightly at 02:00 UTC, TTL 7h. **Known issue:** cache stampede when TTL expires at peak; mitigation via singleflight is planned (ticket SRE-2014). Add jitter to TTL as workaround. ## api-gateway - **Team:** platform - **Tier:** SEV1 - **Upstream:** internet (via CloudFront) - **Downstream:** all services - **Region:** all regions - **Notes:** nginx upstream timeout 60s. DNS TTL for internal CNAMEs is 30s (was 300s before 2025-Q4 — be aware of cached IPs across pods). ## SLOs | Service | Availability | Latency p99 | |---|---|---| | payment-svc | 99.95% | 300ms | | checkout-svc | 99.95% | 500ms | | order-svc | 99.9% | 1s | | catalog-svc | 99.95% | 200ms (cached) | | api-gateway | 99.99% | 50ms (passthrough) | ## On-call escalation 1. Service team (PagerDuty) 2. SRE on-call (15 min if no ack) 3. Engineering manager (30 min if no resolution) 4. VP Eng (60 min, SEV1 only)
- [3]runbookRunbook: payment-svc13%
grows monotonically, check for: in-process caches without eviction, request-id keyed maps, retained event listeners - **Rollback first**, debug after ## SLO - Availability: 99.95% (allows ~22min/month downtime) - p99 latency: < 300ms (excluding Stripe call time) - Error rate: < 0.1% ## Severity policy (overrides generic SEV rubric) - Payment failure rate > 0.5% sustained 3min → **SEV1** (revenue impact) - p99 > 1s for 10min → **SEV2** - Single pod restart → not paged ## Useful commands ```bash # Recent error breakdown kubectl logs -n prod -l app=payment-svc --since=15m | grep -iE "ERROR|FATAL" | awk '{print $NF}' | sort | uniq -c | sort -rn | head # Active DB connections by app kubectl exec -n prod postgres-primary-0 -- psql -c "SELECT application_name, count(*) FROM pg_stat_activity GROUP BY 1 ORDER BY 2 DESC;" # Force a rollback kubectl rollout undo deployment/payment-svc -n prod kubectl rollout status deployment/payment-svc -n prod ``` ## Past incidents (most recent) - 2026-03-12: SEV1, batch job v2.41 exhausted connection pool, 18min impact - 2025-11-04: SEV2, slow Stripe response cascaded into thread exhaustion (fixed by adding circuit breaker) - 2025-08-19: SEV1, OOM crashloop after upgrading json parser (in-process cache leak) - [4]runbookRunbook: payment-svc13%
# Runbook: payment-svc ## Owner Team: payments-platform Slack: #payments-oncall PagerDuty: payments-svc-primary ## What it does `payment-svc` processes checkout transactions. Sits between `checkout-svc` (upstream) and Stripe Connect (downstream). All requests are idempotent — safe to retry. ## Architecture quick facts - Runs as Kubernetes deployment `payment-svc` in `prod` namespace - 12 replicas, HPA min=8 max=30, target CPU 70% - Memory limit 512Mi, request 256Mi - Connects to `postgres-primary.payments` (max_connections=500 shared with 4 other services) - Connection pool: pgbouncer in transaction mode, pool_size=80 per app instance ## Common failure modes ### "FATAL: sorry, too many clients already" + p99 spike - **Almost always** a runaway batch job holding connections during a long query - Check recent deploys (last 2h) for new cron jobs or batch operations - Query: `SELECT pid, query_start, query FROM pg_stat_activity WHERE state != 'idle' AND query_start < now() - interval '1 minute' ORDER BY query_start;` - **Mitigation**: kill the long-running query (`pg_terminate_backend(pid)`), THEN roll back the deploy - **Do NOT** restart payment-svc pods — they'll thrash trying to reconnect to a saturated pool ### OOMKilled pods after deploy - Memory profile must be flat under steady traffic - If memory grows monotonically, check for: in-process caches without eviction, request-id keyed maps, retained event listeners - **Rollback first**, debug after
- [5]postmortemPostmortem: payment-svc DB connection pool exhaustion — 2026-03-1213%
# Postmortem: payment-svc DB connection pool exhaustion — 2026-03-12 **Severity:** SEV1 **Duration:** 18 minutes (14:02 – 14:20 UTC) **Author:** Yan (on-call) · Reviewed by: payments-platform team ## Summary A nightly settlement batch job introduced in payment-svc v2.41 began running every 30 seconds (intended: nightly), each invocation holding 8 database connections for ~2.4s while performing a full-table scan on `ledger_entries`. Within 3 minutes, the 500-connection Postgres pool was fully consumed, starving all foreground checkout traffic. 12% of checkouts failed during the window. Estimated revenue impact: $87k. ## Timeline (UTC) - **13:50** — payment-svc v2.41 deployed (rolling, 100% by 13:54). Changelog mentioned "added nightly settlement batch". - **14:02:08** — first alert: `PaymentSvcErrorRateHigh` (3% errors). - **14:03:15** — CS reports failed checkout volume rising. - **14:04** — On-call paged. - **14:06** — Initial mis-diagnosis: "DB looks healthy, CPU is fine" — looked at CPU only, missed `active_connections` saturation. - **14:11** — Correct diagnosis: connection pool exhausted (active_connections = 500/500). - **14:14** — Identified batch job via `pg_stat_activity` showing recurring `SELECT * FROM ledger_entries WHERE status='pending'`. - **14:16** — Rolled back to v2.40. - **14:20** — Error rate returned to baseline.
Similar past incidents
lexical match (pg_trgm)
- 39%
[Scenario] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 38%
[Eval][v2][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 38%
[Eval][v1][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 30%
[Eval][v2][en] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 29%
[Eval][v2][zh] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts