← All incidents

[Eval][v2][en] Third-party payment gateway timeouts cascade into checkout outage

service: checkout-svccreated: 5/25/2026, 10:22:02 PM

Raw incident context

Time: 18:40 UTC. checkout-svc starts returning HTTP 504 to ~78% of checkout attempts.

Symptoms:
- p99 latency: 28s (hitting our 30s gateway timeout)
- Success rate: 22% (down from 99.5%)
- Thread pool utilization: 100% (all 200 worker threads blocked)
- Inbound queue depth: 4200 requests (queue limit 5000)

Downstream calls (from APM):
- Stripe Connect API (/v1/payment_intents): p99 jumped from 800ms to 27s
- Stripe status page: "Investigating elevated latency for Connect endpoints in us-east-1" (posted 18:35 UTC)
- All other downstream deps (auth-svc, fraud-svc) normal

Our config:
- Stripe call timeout: 30s (matches inbound)
- No circuit breaker on Stripe client
- No bulkhead — Stripe calls share the main worker thread pool

Customer impact:
- Failed checkouts: ~3000 in 5min
- Estimated lost GMV: $180k (so far)

On-call:
- 18:38 — pager
- 18:41 — confirmed Stripe is the culprit (status page + APM)
- 18:43 — debating: wait it out vs disable Stripe path entirely

Summary

SEV1

Checkout service is failing for ~78% of users due to Stripe Connect API latency spike (p99 27s vs 800ms baseline). All 200 worker threads are blocked waiting on Stripe responses, causing thread pool exhaustion and 504 timeouts. Stripe has acknowledged the issue on their status page. Estimated $180k GMV lost in 5 minutes.

Severity reasoning: User-facing outage: success rate dropped from 99.5% to 22% (>1% error rate for >5 min). Revenue path broken: estimated $180k GMV lost. Regional unavailability: checkout service effectively down for most users.

deepseek-chat·prompt v2·output: en·14767ms

Root cause hypotheses

  • highStripe Connect API latency spike causing thread pool exhaustion

    Evidence: APM shows Stripe /v1/payment_intents p99 jumped from 800ms to 27s; thread pool utilization 100%; Stripe status page confirms elevated latency in us-east-1 since 18:35 UTC.

  • highNo circuit breaker on Stripe client causing cascading failures

    Evidence: Config review shows no circuit breaker or bulkhead for Stripe calls; all threads block waiting on Stripe, leading to queue buildup (4200 requests).

  • mediumStripe API timeout mismatch causing retry storms

    Evidence: Inbound gateway timeout is 30s, Stripe call timeout is also 30s; no client-side retry policy visible; but if retries exist, they could amplify load.

  • lowResource exhaustion on checkout-svc pods (CPU/memory)

    Evidence: Thread pool exhaustion is the primary symptom, but CPU/memory metrics not provided; could be secondary factor.

  • lowNetwork issue between checkout-svc and Stripe

    Evidence: Stripe status page indicates their own issue; network path unlikely given other services normal.

Investigation checklist

  1. Check Stripe status page for updates
    curl -s https://status.stripe.com/api/v2/status.json | jq '.status.description'

    Expected: Should show 'Investigating elevated latency for Connect endpoints in us-east-1' or resolved status.

  2. Verify thread pool exhaustion in checkout-svc logs
    kubectl logs -n prod -l app=checkout-svc --since=20m | grep -i 'thread pool' | tail -20

    Expected: Lines indicating all threads busy, queue full, or rejected tasks.

  3. Check APM for Stripe dependency latency breakdown
    curl -s 'http://apm-dashboard:8080/api/v1/services/checkout-svc/dependencies/stripe?time=18:30-18:50' | jq '.p99_latency_ms'

    Expected: p99 latency around 27000ms (27s) vs baseline 800ms.

  4. Inspect Stripe client configuration for circuit breaker and timeout
    kubectl exec -n prod deploy/checkout-svc -- cat /app/config/stripe.yaml | grep -E 'timeout|circuit|bulkhead|retry'

    Expected: No circuit breaker or bulkhead settings; timeout likely 30s.

  5. Check inbound queue depth and request rate
    kubectl exec -n prod deploy/checkout-svc -- curl -s localhost:8080/metrics | grep -E 'queue_depth|request_rate'

    Expected: queue_depth near 4200, request_rate stable or increasing.

  6. Verify other downstream services are healthy
    kubectl logs -n prod -l app=auth-svc --since=10m | grep -c 'error' && kubectl logs -n prod -l app=fraud-svc --since=10m | grep -c 'error'

    Expected: Low error counts (0-5) indicating auth and fraud are normal.

  7. Check if Stripe API key or rate limit issues exist
    kubectl logs -n prod -l app=checkout-svc --since=20m | grep -iE 'rate limit|429|too many requests' | head -10

    Expected: No rate limit errors; Stripe issue is latency, not throttling.

Mitigation plan

  • Enable circuit breaker on Stripe client to fail fast and shed load

    Risk: May cause partial checkout failures for users if circuit opens; but better than total outage. Safer than disabling Stripe entirely.

    Rollback: Disable circuit breaker by setting threshold to 0 or removing config; restart pods.

  • Increase thread pool size temporarily to handle queued requests

    Risk: May increase resource pressure (CPU/memory) on pods; could cause OOM if too large. Monitor closely.

    Rollback: Revert thread pool size to original value and restart pods.

  • Implement bulkhead isolation for Stripe calls to prevent thread pool exhaustion

    Risk: Requires code change and deployment; not immediate. For now, use circuit breaker as faster mitigation.

    Rollback: Remove bulkhead config and restart.

  • Reduce Stripe call timeout to 5s to fail fast and avoid queue buildup

    Risk: May cause false positives if Stripe latency is transient; but 5s is reasonable for payment intents. Safer than 30s.

    Rollback: Revert timeout to 30s and restart pods.

Customer impact

Approximately 78% of checkout attempts are failing with HTTP 504 errors. Users see a timeout or error page after waiting up to 30 seconds. Estimated 3000 failed checkouts in 5 minutes, resulting in $180k lost GMV. No data loss; payments are not processed. ETA depends on Stripe recovery.

Postmortem draft

Summary

[FILL IN: 2-3 sentence summary of incident]

Timeline (UTC)

  • 18:35 - Stripe status page: "Investigating elevated latency for Connect endpoints in us-east-1"
  • 18:38 - Pager triggered for checkout-svc high error rate
  • 18:41 - Confirmed Stripe is the culprit via APM and status page
  • 18:43 - Debate: wait vs disable Stripe path
  • [FILL IN: mitigation actions and resolution time]

Impact

  • Checkout success rate dropped from 99.5% to 22%
  • p99 latency 28s (30s timeout)
  • ~3000 failed checkouts, $180k lost GMV
  • Duration: [FILL IN: start to end time]

Root Cause

Stripe Connect API latency spike (p99 27s vs 800ms baseline) caused thread pool exhaustion in checkout-svc due to lack of circuit breaker and bulkhead isolation.

Detection

  • Pager triggered by error rate alert (threshold >1% for 5 min)
  • Stripe status page provided external confirmation

Response

  • [FILL IN: actions taken, e.g., enabled circuit breaker, increased thread pool]
  • [FILL IN: communication with Stripe support]

What Went Well

  • Quick identification of Stripe as root cause via APM and status page
  • Team communication clear

What Went Poorly

  • No circuit breaker or bulkhead on Stripe client
  • Timeout too long (30s) causing queue buildup
  • No automated failover to alternative payment provider

Action Items

  • [FILL IN: implement circuit breaker and bulkhead for Stripe]
  • [FILL IN: reduce Stripe timeout to 5s]
  • [FILL IN: add alert on thread pool utilization >80%]
  • [FILL IN: consider fallback payment provider]

Follow-ups

  • P0Implement circuit breaker on Stripe client with configurable threshold and half-open recoveryservice owner
  • P0Add bulkhead isolation for Stripe calls to separate thread poolservice owner
  • P1Reduce Stripe call timeout from 30s to 5sservice owner
  • P1Add alert on thread pool utilization >80% and queue depth >1000on-call SRE
  • P2Evaluate fallback payment provider for critical pathplatform team
  • P1Update incident runbook for Stripe dependency failures with circuit breaker stepson-call SRE
  • P2Review Stripe API retry policy and ensure exponential backoff with jitterservice owner