← All incidents

[Eval][v2][en] Regional 5xx spike after DNS TTL change

service: api-gatewaycreated: 5/25/2026, 10:23:03 PM

Raw incident context

Time: 22:10 UTC. api-gateway in us-west-2 returning HTTP 502 for ~35% of requests. us-east-1 unaffected.

Symptoms (us-west-2 only):
- 502 rate: 35% (baseline 0.05%)
- p99 latency: 4s (mostly DNS resolution timeouts in error trace)
- nginx upstream errors: "upstream timed out (110: Connection timed out) while connecting to upstream"

Recent changes:
- 21:40 UTC — DNS team changed TTL for internal-services.example.com from 300s → 30s as part of a planned migration
- Also changed CNAME target: internal-services.example.com now points to a NEW NLB (was pointing to a legacy ALB)
- Old ALB still up and answering, but its DNS records were not removed
- New NLB was created in us-east-1 only — no us-west-2 record exists

Logs:
- nginx in us-west-2 trying to resolve internal-services.example.com → getting NXDOMAIN or stale IPs
- ~30% of pods have cached the old ALB IP and are working; rest are failing
- us-east-1 pods all resolve correctly to the new NLB

On-call:
- 22:08 — pager
- 22:11 — DNS team confirms the planned change
- 22:12 — debate: revert DNS or hot-patch us-west-2 to point at us-east-1 NLB

Summary

SEV1

A planned DNS change at 21:40 UTC misconfigured the CNAME for internal-services.example.com to point to a new NLB that exists only in us-east-1, causing 35% of requests in us-west-2 to fail with HTTP 502 due to DNS resolution timeouts. The incident is ongoing with a 35% error rate and elevated p99 latency of 4s.

Severity reasoning: User-facing outage with error rate >1% (35%) for >5 minutes, meeting SEV1 criteria. Revenue path is likely broken for affected users.

deepseek-chat·prompt v2·output: en·10807ms

Root cause hypotheses

  • highDNS CNAME misconfiguration: new NLB only exists in us-east-1, causing NXDOMAIN for us-west-2 queries

    Evidence: Logs show nginx in us-west-2 getting NXDOMAIN or stale IPs for internal-services.example.com; DNS team confirmed CNAME changed to new NLB which is us-east-1 only

  • mediumStale DNS cache in some pods: pods with cached old ALB IP work, others fail

    Evidence: ~30% of pods working (cached old IP), rest failing; TTL was reduced from 300s to 30s but some pods may not have refreshed

  • lowNew NLB health check failure causing DNS to return no records

    Evidence: No direct evidence; NLB is new and may not have passed health checks, but us-east-1 works fine

Investigation checklist

  1. Check DNS resolution for internal-services.example.com from us-west-2 pods
    kubectl exec -n prod -l app=api-gateway -- dig +short internal-services.example.com @<cluster-dns-ip>

    Expected: Should return an IP; if NXDOMAIN or empty, DNS misconfiguration confirmed

  2. Verify new NLB exists in us-west-2
    aws elbv2 describe-load-balancers --region us-west-2 --names new-nlb-name

    Expected: If command returns empty or error, NLB does not exist in us-west-2

  3. Check nginx upstream error logs for DNS resolution failures
    kubectl logs -n prod -l app=api-gateway --since=30m | grep -i 'upstream timed out' | head -20

    Expected: Should show 'Connection timed out' errors with upstream hostname

  4. Compare DNS records between regions
    dig +short internal-services.example.com @<us-west-2-resolver> && dig +short internal-services.example.com @<us-east-1-resolver>

    Expected: us-west-2 should return IP of old ALB or new NLB; if different, misconfiguration

  5. Check if old ALB is still healthy and serving
    aws elbv2 describe-target-health --region us-west-2 --target-group-arn <old-alb-tg-arn>

    Expected: Targets should be healthy; if not, old ALB may be degraded

Mitigation plan

  • Revert DNS CNAME for internal-services.example.com to point back to the old ALB in us-west-2

    Risk: Low risk; old ALB is still healthy. May cause brief propagation delay.

    Rollback: Re-apply the new CNAME if needed after fixing the NLB deployment

  • If revert not possible, create a new NLB in us-west-2 and update DNS to point to it

    Risk: Medium risk; provisioning NLB takes time, and DNS propagation may cause continued errors. Ensure health checks pass before switching.

    Rollback: Revert DNS to old ALB if new NLB fails

Customer impact

Approximately 35% of users in the us-west-2 region are experiencing HTTP 502 errors and high latency (p99 4s) when accessing the API gateway. Users in us-east-1 are unaffected. Estimated affected users: 35% of traffic in us-west-2.

Postmortem draft

Summary

[FILL IN]

Timeline (UTC)

  • 21:40: DNS team changed TTL and CNAME for internal-services.example.com
  • 22:08: Pager alerted for 35% 502 errors in us-west-2
  • 22:10: Incident declared
  • 22:11: DNS team confirmed change
  • 22:12: Debate on mitigation
  • [FILL IN] Mitigation applied
  • [FILL IN] Incident resolved

Impact

  • 35% error rate on api-gateway in us-west-2 for [duration] minutes
  • p99 latency 4s
  • Estimated [X] users affected

Root Cause

DNS CNAME for internal-services.example.com was changed to point to a new NLB that only existed in us-east-1, causing DNS resolution failures in us-west-2.

Detection

Alert on 502 error rate triggered pager at 22:08.

Response

  • [FILL IN] Reverted DNS to old ALB
  • [FILL IN] Verified recovery

What Went Well

  • Quick detection via monitoring
  • DNS team available to confirm change

What Went Poorly

  • DNS change not validated across regions before deployment
  • No pre-change review for cross-region impact

Action Items

  • [FILL IN] Add pre-deployment validation for DNS changes across regions
  • [FILL IN] Implement canary DNS rollout
  • [FILL IN] Review change management process for DNS

Follow-ups

  • P0Add pre-deployment validation to check that DNS targets exist in all required regionsplatform team
  • P1Implement canary DNS rollout with gradual traffic shiftDNS team
  • P1Review change management process to require cross-region impact assessment for DNS changesSRE team
  • P2Add monitoring for DNS resolution failures per regionobservability team
  • P2Document incident and update runbook for DNS-related outageson-call SRE