← All incidents

[Eval][v1][en] Third-party payment gateway timeouts cascade into checkout outage

service: checkout-svccreated: 6/2/2026, 11:55:46 PM

Raw incident context

Time: 18:40 UTC. checkout-svc starts returning HTTP 504 to ~78% of checkout attempts.

Symptoms:
- p99 latency: 28s (hitting our 30s gateway timeout)
- Success rate: 22% (down from 99.5%)
- Thread pool utilization: 100% (all 200 worker threads blocked)
- Inbound queue depth: 4200 requests (queue limit 5000)

Downstream calls (from APM):
- Stripe Connect API (/v1/payment_intents): p99 jumped from 800ms to 27s
- Stripe status page: "Investigating elevated latency for Connect endpoints in us-east-1" (posted 18:35 UTC)
- All other downstream deps (auth-svc, fraud-svc) normal

Our config:
- Stripe call timeout: 30s (matches inbound)
- No circuit breaker on Stripe client
- No bulkhead — Stripe calls share the main worker thread pool

Customer impact:
- Failed checkouts: ~3000 in 5min
- Estimated lost GMV: $180k (so far)

On-call:
- 18:38 — pager
- 18:41 — confirmed Stripe is the culprit (status page + APM)
- 18:43 — debating: wait it out vs disable Stripe path entirely

Summary

SEV1

Checkout success rate dropped from 99.5% to 22% due to elevated latency from Stripe Connect API, causing thread pool exhaustion and 504 errors. Estimated $180k GMV lost in 5 minutes.

Severity reasoning: User-facing outage with revenue impact (estimated $180k GMV lost), affecting ~78% of checkout attempts. Full outage of a critical payment path.

deepseek-chat·prompt v1·output: en·9142ms·1370↑ / 1439↓ tok·$0.00195

Root cause hypotheses

  • highStripe Connect API latency spike in us-east-1 causing downstream timeout

    Evidence: APM shows Stripe /v1/payment_intents p99 jumped from 800ms to 27s; Stripe status page confirms elevated latency for Connect endpoints in us-east-1 since 18:35 UTC.

  • highNo circuit breaker on Stripe client causing cascading thread pool exhaustion

    Evidence: Thread pool utilization at 100% with all 200 worker threads blocked; Stripe call timeout matches inbound timeout (30s), no bulkhead isolation.

  • mediumInbound request surge amplifying Stripe latency impact

    Evidence: Inbound queue depth 4200 (limit 5000), indicating high traffic volume; combined with slow Stripe responses, threads are held longer.

Investigation checklist

  1. Confirm Stripe latency via APM
    curl -s 'http://apm-dashboard:8080/api/traces?service=checkout-svc&span=stripe-payment&from=18:35&to=18:45' | jq '.p99_latency_ms'

    Expected: Value near 27000ms, confirming Stripe is the bottleneck.

  2. Check thread pool state
    kubectl exec deploy/checkout-svc -- curl -s localhost:8080/actuator/threaddump | grep 'pool-1-thread' | wc -l

    Expected: 200 threads in BLOCKED or WAITING state.

  3. Verify Stripe status page
    curl -s 'https://status.stripe.com/api/v1/incidents?since=18:30' | jq '.incidents[] | select(.name | contains("Connect"))'

    Expected: Incident with status 'investigating' for Connect endpoints in us-east-1.

  4. Check if circuit breaker exists
    kubectl exec deploy/checkout-svc -- grep -r 'circuitBreaker' /app/config/

    Expected: No output, confirming no circuit breaker configured.

Mitigation plan

  • Enable feature flag to disable Stripe payment path and fall back to a degraded mode (e.g., show 'payment unavailable' or use alternative provider if available).

    Risk: Customers cannot complete checkout via Stripe; potential revenue loss but prevents further timeouts and thread exhaustion.

    Rollback: Disable the feature flag to re-enable Stripe path once Stripe recovers.

  • If fallback not possible, increase thread pool size temporarily to 400 and reduce Stripe timeout to 10s to fail fast.

    Risk: Increased thread count may cause memory pressure; shorter timeout may cause partial failures but reduces thread blocking.

    Rollback: Revert thread pool to 200 and timeout to 30s after incident.

Customer impact

Approximately 78% of checkout attempts are failing with HTTP 504 errors. Users see a timeout or error page after 30 seconds. Estimated 3000 failed checkouts in 5 minutes, with $180k GMV lost. No ETA for resolution; dependent on Stripe recovery.

Postmortem draft

Postmortem: Checkout-svc Outage (2025-03-27)

Summary

Brief description of the incident.

Timeline

  • 18:35 UTC: Stripe reports elevated latency for Connect endpoints.
  • 18:40 UTC: checkout-svc success rate drops to 22%, p99 latency 28s.
  • 18:41 UTC: On-call paged; Stripe identified as culprit.
  • 18:43 UTC: Mitigation actions taken.
  • [Time] UTC: Incident resolved.

Impact

  • Failed checkouts: ~3000
  • Lost GMV: $180k
  • Duration: [X] minutes

Root Cause

Stripe Connect API latency spike combined with lack of circuit breaker and bulkhead caused thread pool exhaustion.

What Went Well

  • Quick identification of Stripe as root cause via APM and status page.
  • [Other positives]

What Went Poorly

  • No circuit breaker on Stripe client.
  • No bulkhead isolation for downstream calls.
  • [Other negatives]

Action Items

  • [ ] Add circuit breaker to Stripe client (P0)
  • [ ] Implement bulkhead pattern for downstream dependencies (P1)
  • [ ] Reduce Stripe timeout to 10s with retry (P1)
  • [ ] Add alert on thread pool utilization >80% (P2)

Follow-ups

  • P0Add circuit breaker to Stripe client with fallback logicservice owner
  • P1Implement bulkhead pattern for downstream calls (separate thread pool for Stripe)service owner
  • P1Reduce Stripe timeout to 10s and add retry with exponential backoffservice owner
  • P2Add alert on thread pool utilization >80% and queue depth >3000on-call SRE
  • P2Review and update incident response runbook for Stripe dependency failuresplatform team