← All incidents

[Eval][v1][en] Third-party payment gateway timeouts cascade into checkout outage

service: checkout-svccreated: 6/2/2026, 11:55:32 PM

Raw incident context

Time: 18:40 UTC. checkout-svc starts returning HTTP 504 to ~78% of checkout attempts.

Symptoms:
- p99 latency: 28s (hitting our 30s gateway timeout)
- Success rate: 22% (down from 99.5%)
- Thread pool utilization: 100% (all 200 worker threads blocked)
- Inbound queue depth: 4200 requests (queue limit 5000)

Downstream calls (from APM):
- Stripe Connect API (/v1/payment_intents): p99 jumped from 800ms to 27s
- Stripe status page: "Investigating elevated latency for Connect endpoints in us-east-1" (posted 18:35 UTC)
- All other downstream deps (auth-svc, fraud-svc) normal

Our config:
- Stripe call timeout: 30s (matches inbound)
- No circuit breaker on Stripe client
- No bulkhead — Stripe calls share the main worker thread pool

Customer impact:
- Failed checkouts: ~3000 in 5min
- Estimated lost GMV: $180k (so far)

On-call:
- 18:38 — pager
- 18:41 — confirmed Stripe is the culprit (status page + APM)
- 18:43 — debating: wait it out vs disable Stripe path entirely

Summary

SEV1

Checkout success rate collapsed to 22% due to Stripe Connect API latency spike (p99 27s) exhausting the shared worker thread pool. All 200 threads blocked waiting on Stripe responses, causing 78% of requests to hit the 30s gateway timeout. Estimated $180k GMV lost in 5 minutes.

Severity reasoning: User-facing outage with revenue impact (estimated $180k GMV lost). Full degradation of checkout for 78% of users, meeting SEV1 criteria.

deepseek-chat·prompt v1·output: en·9737ms·1370↑ / 1504↓ tok·$0.00202

Root cause hypotheses

  • highStripe Connect API latency spike in us-east-1 causing downstream timeouts

    Evidence: APM shows Stripe /v1/payment_intents p99 jumped from 800ms to 27s; Stripe status page confirms elevated latency for Connect endpoints in us-east-1 since 18:35 UTC

  • highNo circuit breaker on Stripe client allowing all threads to block

    Evidence: Config review shows no circuit breaker or timeout shorter than inbound 30s; thread pool 100% utilization with all 200 threads blocked

  • highNo bulkhead isolation causing Stripe latency to exhaust shared thread pool

    Evidence: Thread pool shared across all downstream calls; Stripe calls consume all threads, blocking other dependencies (auth-svc, fraud-svc) which are healthy

Investigation checklist

  1. Confirm Stripe is the bottleneck by checking APM traces for checkout-svc
    kubectl exec -n prod deploy/checkout-svc -- curl -s localhost:3000/metrics | grep 'http_client_requests_seconds_sum{downstream="stripe"}'

    Expected: High latency values (>20s) for Stripe downstream calls

  2. Check thread pool metrics to confirm exhaustion
    kubectl exec -n prod deploy/checkout-svc -- curl -s localhost:3000/metrics | grep 'thread_pool_active_threads'

    Expected: Value equal to max threads (200) indicating full utilization

  3. Verify Stripe status page for ongoing incident
    curl -s https://status.stripe.com/api/v1/incidents | jq '.incidents[] | select(.status != "resolved")'

    Expected: Active incident for Connect endpoints in us-east-1

  4. Check if other downstream dependencies are affected
    kubectl exec -n prod deploy/checkout-svc -- curl -s localhost:3000/metrics | grep 'http_client_requests_seconds_sum{downstream=~"auth-svc|fraud-svc"}'

    Expected: Normal latency (<100ms) for auth-svc and fraud-svc

Mitigation plan

  • Disable Stripe payment path by toggling feature flag 'stripe_payments_enabled' to false, falling back to alternative payment provider or showing maintenance page

    Risk: Customers cannot pay via Stripe; alternative provider may have capacity issues. No data loss.

    Rollback: Set feature flag back to true once Stripe incident resolved

  • If feature flag not available, scale up checkout-svc replicas to increase thread pool capacity temporarily

    Risk: Increased load on downstream services; may not help if Stripe latency persists. No destructive operations.

    Rollback: Scale down replicas after Stripe recovery

  • Add circuit breaker and bulkhead for Stripe client in code (long-term fix, not immediate)

    Risk: Requires code change and deployment; not for immediate mitigation.

    Rollback: Revert code change if issues

Customer impact

Approximately 78% of checkout attempts are failing with HTTP 504 errors. Estimated 3000 failed checkouts in 5 minutes, resulting in ~$180k lost GMV. Customers cannot complete purchases using Stripe. No ETA for resolution; Stripe is investigating.

Postmortem draft

Postmortem: Checkout-svc Outage (2025-04-10)

Summary

Brief description of incident.

Timeline

  • 18:35 UTC: Stripe reports elevated latency for Connect endpoints
  • 18:40 UTC: checkout-svc success rate drops to 22%
  • 18:41 UTC: On-call paged
  • 18:43 UTC: Stripe confirmed as culprit
  • [Mitigation time]: Feature flag disabled Stripe path
  • [Recovery time]: Success rate returned to normal

Impact

  • Failed checkouts: ~3000
  • Lost GMV: ~$180k
  • Duration: [X] minutes

Root Cause

Stripe Connect API latency spike combined with lack of circuit breaker and bulkhead isolation caused thread pool exhaustion.

What Went Well

  • Quick detection via metrics and APM
  • Stripe status page provided external confirmation

What Went Poorly

  • No circuit breaker on Stripe client
  • No bulkhead isolation for downstream calls
  • No feature flag to disable Stripe quickly (if not available)

Action Items

  • [ ] Add circuit breaker to Stripe client (P0)
  • [ ] Implement bulkhead isolation for downstream calls (P1)
  • [ ] Create feature flag to disable payment providers (P1)
  • [ ] Review timeout configurations (P2)

Follow-ups

  • P0Add circuit breaker to Stripe HTTP client with timeout and failure thresholdservice owner
  • P1Implement bulkhead isolation for downstream calls (separate thread pool for each dependency)service owner
  • P1Create feature flag to disable Stripe payment path without deploymentplatform team
  • P2Review and reduce Stripe call timeout from 30s to 5s with retryservice owner
  • P2Add alert on thread pool utilization >80%on-call SRE