← All incidents

[Eval][v1][en] Regional 5xx spike after DNS TTL change

service: api-gatewaycreated: 6/3/2026, 12:00:41 AM

Raw incident context

Time: 22:10 UTC. api-gateway in us-west-2 returning HTTP 502 for ~35% of requests. us-east-1 unaffected.

Symptoms (us-west-2 only):
- 502 rate: 35% (baseline 0.05%)
- p99 latency: 4s (mostly DNS resolution timeouts in error trace)
- nginx upstream errors: "upstream timed out (110: Connection timed out) while connecting to upstream"

Recent changes:
- 21:40 UTC — DNS team changed TTL for internal-services.example.com from 300s → 30s as part of a planned migration
- Also changed CNAME target: internal-services.example.com now points to a NEW NLB (was pointing to a legacy ALB)
- Old ALB still up and answering, but its DNS records were not removed
- New NLB was created in us-east-1 only — no us-west-2 record exists

Logs:
- nginx in us-west-2 trying to resolve internal-services.example.com → getting NXDOMAIN or stale IPs
- ~30% of pods have cached the old ALB IP and are working; rest are failing
- us-east-1 pods all resolve correctly to the new NLB

On-call:
- 22:08 — pager
- 22:11 — DNS team confirms the planned change
- 22:12 — debate: revert DNS or hot-patch us-west-2 to point at us-east-1 NLB

Summary

SEV1

A DNS change for internal-services.example.com at 21:40 UTC caused a 35% 502 error rate in us-west-2 api-gateway because the new CNAME target (NLB) only exists in us-east-1. Pods with cached old ALB IPs work; others fail with DNS resolution timeouts.

Severity reasoning: User-facing outage with 35% error rate and elevated latency (p99 4s) in us-west-2, affecting a significant portion of traffic. Revenue impact likely. Reversibility is high (revert DNS change), but immediate user impact qualifies as SEV1.

deepseek-chat·prompt v1·output: en·12153ms·1403↑ / 1497↓ tok·$0.00203

Root cause hypotheses

  • highDNS CNAME target points to an NLB that does not exist in us-west-2, causing resolution failures for pods without cached old ALB IPs.

    Evidence: DNS team changed CNAME to new NLB which was created only in us-east-1; us-west-2 pods get NXDOMAIN or stale IPs; nginx logs show 'upstream timed out' with DNS resolution timeouts.

  • highTTL reduction from 300s to 30s caused rapid propagation of the bad DNS record, increasing the failure rate as caches expired.

    Evidence: TTL changed at 21:40 UTC; incident started shortly after; 30% of pods still have old cached IPs (pre-change) and work.

  • lowThe old ALB was decommissioned or its DNS removed, but logs show it is still up and answering; this is not the cause.

    Evidence: Old ALB still up; pods with cached old IPs work fine.

Investigation checklist

  1. Verify DNS resolution for internal-services.example.com from a pod in us-west-2.
    kubectl exec -n <namespace> <pod-name> -- nslookup internal-services.example.com

    Expected: Should return the new NLB IP (us-east-1) or NXDOMAIN; if old ALB IP, pod has cached entry.

  2. Check if the new NLB exists in us-west-2.
    aws elbv2 describe-load-balancers --region us-west-2 --names <new-nlb-name>

    Expected: If empty, NLB does not exist in us-west-2, confirming the root cause.

  3. Check nginx upstream configuration to see if it uses DNS for upstream resolution.
    kubectl exec -n <namespace> <pod-name> -- cat /etc/nginx/conf.d/default.conf | grep -A5 upstream

    Expected: Should show 'resolver' directive and upstream pointing to internal-services.example.com.

  4. Check the TTL of the DNS record from a pod.
    kubectl exec -n <namespace> <pod-name> -- dig +short internal-services.example.com

    Expected: TTL should be 30s; if still 300s, pod has not refreshed.

  5. Confirm that us-east-1 pods resolve correctly.
    kubectl exec -n <namespace> -c <container> -- nslookup internal-services.example.com

    Expected: Should return the new NLB IP in us-east-1.

Mitigation plan

  • Revert the DNS change: point internal-services.example.com back to the old ALB CNAME and restore TTL to 300s.

    Risk: Low risk; old ALB is still operational. DNS propagation may take a few minutes.

    Rollback: Re-apply the new DNS change if needed after fixing the NLB deployment.

  • Alternatively, create the new NLB in us-west-2 and update DNS to include both regions (e.g., using Route53 latency routing).

    Risk: Takes longer; requires creating NLB and updating DNS; may cause additional errors during creation.

    Rollback: Delete the new NLB in us-west-2 and revert DNS to old ALB.

Customer impact

Approximately 35% of requests to api-gateway in us-west-2 are failing with HTTP 502 errors. Users experience timeouts or errors. us-east-1 is unaffected. ETA: mitigation within 15 minutes if DNS is reverted.

Postmortem draft

Postmortem: api-gateway 502 errors in us-west-2

Summary: DNS change for internal-services.example.com pointed to an NLB that only existed in us-east-1, causing 35% of requests in us-west-2 to fail.

Timeline:

  • 21:40 UTC: DNS team changes TTL and CNAME for internal-services.example.com
  • 22:10 UTC: 502 errors detected
  • 22:08 UTC: Pager
  • 22:11 UTC: DNS team confirms change
  • 22:12 UTC: Mitigation started (revert DNS)

Impact: 35% error rate, p99 latency 4s, us-west-2 only.

Root Cause: New NLB not deployed in us-west-2; DNS change propagated before infrastructure was ready.

What went well: Quick detection, clear logs, DNS team available.

What went poorly: No pre-flight check for NLB existence in all regions; TTL reduction accelerated failure.

Action Items:

  • [ ] Add pre-deployment validation that DNS targets exist in all regions
  • [ ] Implement canary DNS changes with gradual TTL reduction
  • [ ] Update runbook for DNS-related incidents

Follow-ups

  • P1Add pre-deployment validation script to check DNS target existence in all regions.platform team
  • P1Implement canary DNS changes with gradual TTL reduction and monitoring.DNS team
  • P2Update incident runbook for DNS-related api-gateway issues.on-call SRE
  • P2Review change management process for DNS modifications.service owner