← All incidents

[datadog] checkout p99 8s

service: checkoutcreated: 6/3/2026, 12:13:38 AM

Raw incident context

Datadog priority: P2
Env: prod
Tags: service:checkout, env:prod

p99 latency spiked 200ms→8s at 14:02 UTC, no recent deploy, redis_active_conns flatlined at pool max

Summary

SEV1

Checkout p99 latency spiked from 200ms to 8s at 14:02 UTC due to Redis connection pool exhaustion, likely caused by a downstream payment-svc batch job consuming all available connections. No recent deploy to checkout-svc, but payment-svc had a batch job introduced in v2.41 that previously caused a similar SEV1 incident. The latency spike is user-facing and exceeds the checkout-svc SLO of 500ms p99.

Severity reasoning: User-facing outage: checkout p99 latency 8s (16x SLO) and error rate likely >1% (based on previous similar incident). Revenue path broken as checkout failures impact payment processing. Matches SEV1 rubric: user-facing outage, error rate >1% for >5min, revenue path broken.

deepseek-chat·prompt v2·output: en·11700ms·3569↑ / 1609↓ tok·$0.00273

Root cause hypotheses

  • highPayment-svc batch job (v2.41) exhausting Postgres connection pool, causing payment-svc to hang and block checkout-svc threads

    Evidence: Datadog shows redis_active_conns flatlined at pool max; previous incident (2026-03-12) had identical symptoms from a batch job consuming all 500 connections. No recent deploy to checkout-svc, but payment-svc may have been redeployed.

  • mediumRedis connection pool exhaustion due to a slow downstream call (e.g., Stripe) causing checkout-svc to hold Redis connections longer

    Evidence: redis_active_conns at pool max; Stripe timeout is 30s and no circuit breaker on checkout-svc (as per service catalog). Could cause connection accumulation.

  • lowCheckout-svc thread pool exhaustion from a traffic spike or slow external dependency

    Evidence: p99 latency spike to 8s; thread pool size is 200. Could be overwhelmed by increased traffic or slow responses from payment-svc or inventory-svc.

Investigation checklist

  1. Check payment-svc error logs for connection refused or batch job activity
    kubectl logs -n prod -l app=payment-svc --since=15m | grep -iE "ERROR|FATAL|too many clients|batch" | head -50

    Expected: If batch job is culprit, see 'FATAL: sorry, too many clients already' or batch job log entries.

  2. Check Postgres active connections and identify long-running queries
    kubectl exec -n prod postgres-primary-0 -- psql -c "SELECT pid, query_start, query FROM pg_stat_activity WHERE state != 'idle' AND query_start < now() - interval '1 minute' ORDER BY query_start;"

    Expected: If batch job, see recurring 'SELECT * FROM ledger_entries WHERE status='pending'' queries.

  3. Check Redis connection pool usage from checkout-svc
    kubectl exec -n prod deployment/checkout-svc -- redis-cli -h redis-checkout info clients | grep connected_clients

    Expected: If pool exhausted, connected_clients near max (e.g., 80 per instance).

  4. Check checkout-svc thread pool status via metrics or logs
    kubectl logs -n prod -l app=checkout-svc --since=15m | grep -i "thread pool" | head -10

    Expected: If thread pool exhausted, see 'thread pool full' or 'rejected execution'.

  5. Check recent deploys to payment-svc in last 2 hours
    kubectl rollout history deployment/payment-svc -n prod | tail -5

    Expected: If a recent deploy (e.g., v2.41) is present, it may have introduced the batch job.

Mitigation plan

  • Kill long-running queries in Postgres to free connections immediately

    Risk: May abort legitimate transactions; safe because payment-svc endpoints are idempotent. Blast radius: only affected queries.

    Rollback: No rollback needed; killed queries can be retried.

  • Roll back payment-svc to previous version if batch job is confirmed

    Risk: Rollback takes ~2 minutes; during that time, connections may remain exhausted. Blast radius: payment-svc only.

    Rollback: Re-deploy the current version if rollback causes issues.

  • Increase Redis connection pool size temporarily if pool exhaustion is confirmed

    Risk: May increase load on Redis; monitor memory. Blast radius: Redis instance.

    Rollback: Revert pool size to original value.

Customer impact

Customers are experiencing checkout delays of up to 8 seconds, and some checkout attempts may fail. This affects all users attempting to complete purchases. Estimated impact: 12% of checkouts failing (based on previous similar incident). No ETA yet.

Postmortem draft

Summary

[FILL IN]

Timeline (UTC)

  • 14:02 — p99 latency spike to 8s, alert triggered
  • [FILL IN] — On-call paged
  • [FILL IN] — Diagnosis: [FILL IN]
  • [FILL IN] — Mitigation applied
  • [FILL IN] — Latency returned to baseline

Impact

  • p99 latency 8s for checkout (SLO: 500ms)
  • Estimated error rate: [FILL IN]%
  • Affected users: all checkout traffic

Root Cause

[FILL IN]

Detection

Datadog alert on p99 latency spike

Response

[FILL IN]

What Went Well

[FILL IN]

What Went Poorly

[FILL IN]

Action Items

  • [FILL IN]

Follow-ups

  • P0Add circuit breaker to checkout-svc for downstream calls (Stripe, payment-svc)storefront team
  • P0Fix payment-svc batch job to run nightly instead of every 30 secondspayments-platform team
  • P1Add monitoring for Postgres active connections per service and alert on pool >80%SRE team
  • P1Review and enforce runbook for rollback-first approach on payment-svc incidentspayments-platform team
  • P2Add Redis connection pool monitoring and alerting for checkout-svcstorefront team

Internal docs used by the AI

  • [1]serviceService catalog (extract)14%
    # Service catalog (extract)
    
    ## payment-svc
    - **Team:** payments-platform
    - **Tier:** SEV1 (revenue-critical)
    - **Upstream:** checkout-svc, retry-orchestrator
    - **Downstream:** Stripe Connect (us-east-1), fraud-svc, audit-log
    - **DB:** postgres-primary.payments (shared with subscription-svc, refund-svc)
    - **Region:** us-east-1 primary, us-west-2 warm replica
    - **Notes:** All endpoints idempotent. Safe to retry. Connection pool 80/instance.
    
    ## checkout-svc
    - **Team:** storefront
    - **Tier:** SEV1
    - **Upstream:** web-frontend, mobile-api
    - **Downstream:** payment-svc, inventory-svc, fraud-svc, Stripe Connect (direct, for some flows)
    - **DB:** postgres-storefront (dedicated)
    - **Region:** us-east-1, us-west-2 (active-active)
    - **Notes:** Stripe timeout is 30s. No circuit breaker as of 2026-Q1 (planned for Q2). Thread pool size 200.
    
    ## order-svc
    - **Team:** storefront
    - **Tier:** SEV2 (order placement requires this but read-only views can degrade)
    - **Upstream:** checkout-svc, mobile-api
    - **Downstream:** inventory-svc, notification-svc
    - **DB:** postgres-orders
    - **Region:** us-east-1, us-west-2
    - **Notes:** Memory limit 512Mi. Watch for unbounded in-process caches — has bitten us twice.
  • [2]serviceService catalog (extract)14%
    DB:** postgres-orders
    - **Region:** us-east-1, us-west-2
    - **Notes:** Memory limit 512Mi. Watch for unbounded in-process caches — has bitten us twice.
    
    ## catalog-svc
    - **Team:** storefront
    - **Tier:** SEV2 (catalog is read-heavy, cached aggressively)
    - **Upstream:** web-frontend, mobile-api
    - **Downstream:** postgres-catalog, Redis cache cluster `cache-catalog`
    - **Region:** us-east-1, us-west-2
    - **Notes:** Cache pre-warmed nightly at 02:00 UTC, TTL 7h. **Known issue:** cache stampede when TTL expires at peak; mitigation via singleflight is planned (ticket SRE-2014). Add jitter to TTL as workaround.
    
    ## api-gateway
    - **Team:** platform
    - **Tier:** SEV1
    - **Upstream:** internet (via CloudFront)
    - **Downstream:** all services
    - **Region:** all regions
    - **Notes:** nginx upstream timeout 60s. DNS TTL for internal CNAMEs is 30s (was 300s before 2025-Q4 — be aware of cached IPs across pods).
    
    ## SLOs
    | Service | Availability | Latency p99 |
    |---|---|---|
    | payment-svc | 99.95% | 300ms |
    | checkout-svc | 99.95% | 500ms |
    | order-svc | 99.9% | 1s |
    | catalog-svc | 99.95% | 200ms (cached) |
    | api-gateway | 99.99% | 50ms (passthrough) |
    
    ## On-call escalation
    1. Service team (PagerDuty)
    2. SRE on-call (15 min if no ack)
    3. Engineering manager (30 min if no resolution)
    4. VP Eng (60 min, SEV1 only)
  • [3]runbookRunbook: payment-svc13%
     grows monotonically, check for: in-process caches without eviction, request-id keyed maps, retained event listeners
    - **Rollback first**, debug after
    
    ## SLO
    - Availability: 99.95% (allows ~22min/month downtime)
    - p99 latency: < 300ms (excluding Stripe call time)
    - Error rate: < 0.1%
    
    ## Severity policy (overrides generic SEV rubric)
    - Payment failure rate > 0.5% sustained 3min → **SEV1** (revenue impact)
    - p99 > 1s for 10min → **SEV2**
    - Single pod restart → not paged
    
    ## Useful commands
    ```bash
    # Recent error breakdown
    kubectl logs -n prod -l app=payment-svc --since=15m | grep -iE "ERROR|FATAL" | awk '{print $NF}' | sort | uniq -c | sort -rn | head
    
    # Active DB connections by app
    kubectl exec -n prod postgres-primary-0 -- psql -c "SELECT application_name, count(*) FROM pg_stat_activity GROUP BY 1 ORDER BY 2 DESC;"
    
    # Force a rollback
    kubectl rollout undo deployment/payment-svc -n prod
    kubectl rollout status deployment/payment-svc -n prod
    ```
    
    ## Past incidents (most recent)
    - 2026-03-12: SEV1, batch job v2.41 exhausted connection pool, 18min impact
    - 2025-11-04: SEV2, slow Stripe response cascaded into thread exhaustion (fixed by adding circuit breaker)
    - 2025-08-19: SEV1, OOM crashloop after upgrading json parser (in-process cache leak)
  • [4]runbookRunbook: payment-svc13%
    # Runbook: payment-svc
    
    ## Owner
    Team: payments-platform
    Slack: #payments-oncall
    PagerDuty: payments-svc-primary
    
    ## What it does
    `payment-svc` processes checkout transactions. Sits between `checkout-svc` (upstream) and Stripe Connect (downstream). All requests are idempotent — safe to retry.
    
    ## Architecture quick facts
    - Runs as Kubernetes deployment `payment-svc` in `prod` namespace
    - 12 replicas, HPA min=8 max=30, target CPU 70%
    - Memory limit 512Mi, request 256Mi
    - Connects to `postgres-primary.payments` (max_connections=500 shared with 4 other services)
    - Connection pool: pgbouncer in transaction mode, pool_size=80 per app instance
    
    ## Common failure modes
    
    ### "FATAL: sorry, too many clients already" + p99 spike
    - **Almost always** a runaway batch job holding connections during a long query
    - Check recent deploys (last 2h) for new cron jobs or batch operations
    - Query: `SELECT pid, query_start, query FROM pg_stat_activity WHERE state != 'idle' AND query_start < now() - interval '1 minute' ORDER BY query_start;`
    - **Mitigation**: kill the long-running query (`pg_terminate_backend(pid)`), THEN roll back the deploy
    - **Do NOT** restart payment-svc pods — they'll thrash trying to reconnect to a saturated pool
    
    ### OOMKilled pods after deploy
    - Memory profile must be flat under steady traffic
    - If memory grows monotonically, check for: in-process caches without eviction, request-id keyed maps, retained event listeners
    - **Rollback first**, debug after
  • [5]postmortemPostmortem: payment-svc DB connection pool exhaustion — 2026-03-1213%
    # Postmortem: payment-svc DB connection pool exhaustion — 2026-03-12
    
    **Severity:** SEV1
    **Duration:** 18 minutes (14:02 – 14:20 UTC)
    **Author:** Yan (on-call) · Reviewed by: payments-platform team
    
    ## Summary
    A nightly settlement batch job introduced in payment-svc v2.41 began running every 30 seconds (intended: nightly), each invocation holding 8 database connections for ~2.4s while performing a full-table scan on `ledger_entries`. Within 3 minutes, the 500-connection Postgres pool was fully consumed, starving all foreground checkout traffic. 12% of checkouts failed during the window. Estimated revenue impact: $87k.
    
    ## Timeline (UTC)
    - **13:50** — payment-svc v2.41 deployed (rolling, 100% by 13:54). Changelog mentioned "added nightly settlement batch".
    - **14:02:08** — first alert: `PaymentSvcErrorRateHigh` (3% errors).
    - **14:03:15** — CS reports failed checkout volume rising.
    - **14:04** — On-call paged.
    - **14:06** — Initial mis-diagnosis: "DB looks healthy, CPU is fine" — looked at CPU only, missed `active_connections` saturation.
    - **14:11** — Correct diagnosis: connection pool exhausted (active_connections = 500/500).
    - **14:14** — Identified batch job via `pg_stat_activity` showing recurring `SELECT * FROM ledger_entries WHERE status='pending'`.
    - **14:16** — Rolled back to v2.40.
    - **14:20** — Error rate returned to baseline.