← All incidents

[Eval][v1][en] Regional 5xx spike after DNS TTL change

service: api-gatewaycreated: 5/25/2026, 10:22:33 PM

Raw incident context

Time: 22:10 UTC. api-gateway in us-west-2 returning HTTP 502 for ~35% of requests. us-east-1 unaffected.

Symptoms (us-west-2 only):
- 502 rate: 35% (baseline 0.05%)
- p99 latency: 4s (mostly DNS resolution timeouts in error trace)
- nginx upstream errors: "upstream timed out (110: Connection timed out) while connecting to upstream"

Recent changes:
- 21:40 UTC — DNS team changed TTL for internal-services.example.com from 300s → 30s as part of a planned migration
- Also changed CNAME target: internal-services.example.com now points to a NEW NLB (was pointing to a legacy ALB)
- Old ALB still up and answering, but its DNS records were not removed
- New NLB was created in us-east-1 only — no us-west-2 record exists

Logs:
- nginx in us-west-2 trying to resolve internal-services.example.com → getting NXDOMAIN or stale IPs
- ~30% of pods have cached the old ALB IP and are working; rest are failing
- us-east-1 pods all resolve correctly to the new NLB

On-call:
- 22:08 — pager
- 22:11 — DNS team confirms the planned change
- 22:12 — debate: revert DNS or hot-patch us-west-2 to point at us-east-1 NLB

Summary

SEV1

A DNS change for internal-services.example.com at 21:40 UTC caused a partial outage in us-west-2: the new CNAME target (NLB) only exists in us-east-1, so ~65% of pods in us-west-2 fail to resolve the new target, resulting in 35% 502 errors and elevated latency. The remaining 35% of pods still have cached old ALB IPs and work normally.

Severity reasoning: User-facing outage with 35% error rate and significant latency increase (p99 4s) in us-west-2, impacting a large subset of users. Revenue impact likely. Reversibility is high (revert DNS change), but immediate mitigation required.

deepseek-chat·prompt v1·output: en·10490ms

Root cause hypotheses

  • highDNS CNAME target points to NLB that does not exist in us-west-2

    Evidence: DNS team changed CNAME for internal-services.example.com to a new NLB created only in us-east-1; us-west-2 pods get NXDOMAIN or stale IPs.

  • highDNS TTL reduction from 300s to 30s caused rapid propagation of bad records

    Evidence: TTL changed at 21:40 UTC; incident started shortly after. Lower TTL means faster adoption of the bad record.

  • lowOld ALB DNS records were removed, causing fallback to fail

    Evidence: Old ALB still up but its DNS records were not removed; however, pods that cached old IPs work, suggesting removal not the issue.

Investigation checklist

  1. Verify DNS resolution for internal-services.example.com from us-west-2 pods
    kubectl exec -n <namespace> <pod> -- nslookup internal-services.example.com

    Expected: Should return the new NLB IP (us-east-1) or NXDOMAIN; currently likely NXDOMAIN or stale IP.

  2. Check if the new NLB exists in us-west-2
    aws elbv2 describe-load-balancers --region us-west-2 --names <new-nlb-name>

    Expected: Should return load balancer details; currently likely 'LoadBalancerNotFound'.

  3. Check nginx upstream configuration for DNS caching
    kubectl exec -n <namespace> <pod> -- cat /etc/nginx/nginx.conf | grep -A5 resolver

    Expected: Should show resolver directive and valid_names; check if resolver is set to cluster DNS.

  4. Confirm us-east-1 resolution works
    kubectl exec -n <namespace> -c <container> -- nslookup internal-services.example.com

    Expected: Should return the new NLB IP in us-east-1.

  5. Check DNS records for internal-services.example.com
    dig internal-services.example.com @<dns-server> +short

    Expected: Should show CNAME to new NLB; currently shows new NLB (us-east-1 IP).

Mitigation plan

  • Immediately revert the DNS change: set CNAME back to old ALB and restore TTL to 300s

    Risk: Brief propagation delay; no destructive operations.

    Rollback: Re-apply the new DNS change after creating NLB in us-west-2.

  • If revert is not possible, create a new NLB in us-west-2 and update DNS to point to it

    Risk: Takes time; may require cross-region configuration.

    Rollback: Revert DNS to old ALB.

Customer impact

Approximately 35% of requests to api-gateway in us-west-2 are failing with HTTP 502 errors. Affected users experience timeouts or errors. us-east-1 users are unaffected. ETA: immediate mitigation via DNS revert expected within 5 minutes.

Postmortem draft

Postmortem: DNS Change Causes Partial Outage in us-west-2

Summary: A planned DNS change (TTL reduction and CNAME update) pointed internal-services.example.com to a new NLB that only existed in us-east-1, causing 35% 502 errors in us-west-2.

Timeline:

  • 21:40 UTC: DNS team changes TTL 300s→30s and CNAME to new NLB (us-east-1 only)
  • 22:08 UTC: Pager alert for 502 errors in us-west-2
  • 22:11 UTC: DNS team confirms change
  • 22:12 UTC: Decision to revert DNS
  • 22:15 UTC: DNS reverted, errors subside

Impact: 35% error rate for us-west-2 api-gateway, p99 latency 4s, duration ~7 minutes.

Root Cause: New NLB not deployed in us-west-2; DNS change propagated before infrastructure was ready.

What Went Well: Quick detection, clear communication with DNS team.

What Went Poorly: No pre-flight check for NLB existence in all regions; DNS change not coordinated with infrastructure readiness.

Action Items:

  • [ ] Implement pre-deployment DNS validation (check target existence in all regions)
  • [ ] Add canary DNS rollout with automatic rollback
  • [ ] Review change management process for DNS updates

Follow-ups

  • P0Create NLB in us-west-2 for internal-services.example.complatform team
  • P1Add pre-deployment check to DNS change pipeline to verify target existence in all regionsplatform team
  • P1Implement canary DNS rollout with automatic rollback on error rate increaseSRE team
  • P2Review and update incident response runbook for DNS-related outageson-call SRE
  • P2Add monitoring for DNS resolution failures per regionobservability team