← All incidents

[Eval][v3][en] Regional 5xx spike after DNS TTL change

service: api-gatewaycreated: 6/3/2026, 12:04:17 AM

Raw incident context

Time: 22:10 UTC. api-gateway in us-west-2 returning HTTP 502 for ~35% of requests. us-east-1 unaffected.

Symptoms (us-west-2 only):
- 502 rate: 35% (baseline 0.05%)
- p99 latency: 4s (mostly DNS resolution timeouts in error trace)
- nginx upstream errors: "upstream timed out (110: Connection timed out) while connecting to upstream"

Recent changes:
- 21:40 UTC — DNS team changed TTL for internal-services.example.com from 300s → 30s as part of a planned migration
- Also changed CNAME target: internal-services.example.com now points to a NEW NLB (was pointing to a legacy ALB)
- Old ALB still up and answering, but its DNS records were not removed
- New NLB was created in us-east-1 only — no us-west-2 record exists

Logs:
- nginx in us-west-2 trying to resolve internal-services.example.com → getting NXDOMAIN or stale IPs
- ~30% of pods have cached the old ALB IP and are working; rest are failing
- us-east-1 pods all resolve correctly to the new NLB

On-call:
- 22:08 — pager
- 22:11 — DNS team confirms the planned change
- 22:12 — debate: revert DNS or hot-patch us-west-2 to point at us-east-1 NLB

Summary

SEV1

A DNS change at 21:40 UTC for internal-services.example.com (TTL reduced from 300s to 30s, CNAME target changed to a new NLB created only in us-east-1) caused 35% of requests to api-gateway in us-west-2 to fail with HTTP 502 due to DNS resolution failures. The new NLB has no us-west-2 record, so pods resolving the new target get NXDOMAIN or stale IPs. us-east-1 is unaffected. Incident is ongoing with 35% error rate and elevated latency.

Severity reasoning: User-facing outage: 35% error rate (502) exceeds 1% threshold for >5 minutes (started at ~21:40, paged at 22:08). Revenue path is broken for affected requests. This meets SEV1 criteria.

deepseek-chat·prompt v3·output: en·15107ms·2035↑ / 1932↓ tok·$0.00267

Root cause hypotheses

  • highDNS change deployed new CNAME target (new NLB) that does not exist in us-west-2, causing resolution failures for pods that flush their cache.

    Evidence: Logs show nginx in us-west-2 getting NXDOMAIN or stale IPs for internal-services.example.com. DNS team confirmed change at 21:40. New NLB only in us-east-1.

  • highReduced TTL (300s to 30s) caused faster propagation of the bad record, exacerbating the impact.

    Evidence: TTL change was part of the same deployment. Shorter TTL means DNS caches expire sooner, causing more pods to re-resolve and hit the missing record.

  • highOld ALB was not removed from DNS, causing split-brain where some pods resolve to old ALB (working) and others to new NLB (failing).

    Evidence: ~30% of pods working (cached old ALB IP), rest failing. Old ALB still up but not in DNS for new target.

  • lowNetwork connectivity issue between us-west-2 and the new NLB (if it were reachable) but NLB doesn't exist in region.

    Evidence: No evidence of connectivity issues; us-east-1 works fine. The NLB is simply absent in us-west-2.

  • lowDNS propagation delay causing intermittent failures, but the issue is persistent and regional.

    Evidence: Persistent 35% error rate for 30+ minutes, not intermittent. Regional scope matches missing NLB.

Investigation checklist

  1. Verify DNS resolution for internal-services.example.com from us-west-2 pods.
    kubectl exec -n prod -l app=api-gateway -- dig +short internal-services.example.com @<kube-dns-service-ip>

    Expected: Should return the old ALB IP (if cached) or NXDOMAIN. If new NLB IP appears, it's wrong region.

  2. Check if new NLB exists in us-west-2.
    aws elbv2 describe-load-balancers --region us-west-2 --names <new-nlb-name>

    Expected: Should return load balancer details. If error or empty, NLB does not exist in us-west-2.

  3. Check nginx upstream error logs for DNS resolution failures.
    kubectl logs -n prod -l app=api-gateway --since=30m | grep -iE 'upstream timed out|connection timed out|NXDOMAIN' | tail -20

    Expected: Should show 'upstream timed out' and DNS resolution errors.

  4. Confirm old ALB is still healthy and serving.
    aws elbv2 describe-target-health --region us-west-2 --target-group-arn <old-alb-tg-arn>

    Expected: All targets healthy. Old ALB is operational.

  5. Check DNS TTL on the record from us-west-2.
    dig +noall +answer internal-services.example.com @<resolver> | head -1

    Expected: TTL should be 30 (new) or 300 (old). If 30, change propagated.

Mitigation plan

  • Revert DNS change: point internal-services.example.com back to old ALB CNAME and restore TTL to 300s.

    Risk: Low risk; old ALB is healthy. Rollback is re-applying the new change.

    Rollback: Re-apply the DNS change (new NLB CNAME) if needed later.

  • Alternatively, create a new NLB in us-west-2 and update DNS to point to it (if old ALB must be decommissioned).

    Risk: Takes longer; may require additional configuration. If misconfigured, could cause further issues.

    Rollback: Revert DNS to old ALB.

  • Hot-patch us-west-2 pods to use a static IP or a different DNS resolver that returns old ALB IP.

    Risk: Temporary workaround; may not scale. Could cause inconsistency if old ALB is decommissioned.

    Rollback: Remove the patch and revert to normal DNS.

Customer impact

Approximately 35% of requests to the API gateway in the US West (Oregon) region are failing with HTTP 502 errors. Affected users see 'Bad Gateway' errors and experience timeouts. The US East region is unaffected. Estimated impact: 35% of traffic in us-west-2, which is a significant portion of overall traffic.

Postmortem draft

Summary

A DNS change (TTL reduction and CNAME target update) caused a partial outage in us-west-2. The new target NLB was not deployed in us-west-2, leading to DNS resolution failures for ~35% of requests.

Timeline (UTC)

  • 21:40 - DNS change deployed (TTL 300→30, CNAME to new NLB)
  • ~21:40 - 502 errors begin in us-west-2
  • 22:08 - Pager triggered
  • 22:11 - DNS team confirms change
  • [FILL IN] - Mitigation applied

Impact

  • 35% error rate (502) for api-gateway in us-west-2
  • p99 latency 4s (DNS timeouts)
  • us-east-1 unaffected

Root Cause

DNS change pointed internal-services.example.com to a new NLB that only existed in us-east-1. Pods in us-west-2 that re-resolved got NXDOMAIN or stale IPs.

Detection

Pager alert for elevated 502 rate.

Response

  • [FILL IN] - Reverted DNS change

What Went Well

  • Quick identification of DNS change as cause
  • us-east-1 unaffected

What Went Poorly

  • DNS change not validated across regions
  • No pre-deployment testing

Action Items

  • [FILL IN] - Add regional validation for DNS changes
  • [FILL IN] - Implement canary DNS rollout
  • [FILL IN] - Add monitoring for DNS resolution failures per region

Follow-ups

  • P0Add pre-deployment validation to ensure DNS targets exist in all regions where they are used.platform-team
  • P1Implement canary DNS rollout with gradual TTL reduction and monitoring.dns-team
  • P1Add alerting for DNS resolution failures (NXDOMAIN, timeouts) per service per region.observability-team
  • P2Review change management process for DNS modifications to require cross-region verification.sre-team
  • P2Document this incident and create a runbook for DNS-related outages.on-call-sre