← All incidents

[Eval][v2][en] Regional 5xx spike after DNS TTL change

service: api-gatewaycreated: 6/2/2026, 11:39:59 PM

Raw incident context

Time: 22:10 UTC. api-gateway in us-west-2 returning HTTP 502 for ~35% of requests. us-east-1 unaffected.

Symptoms (us-west-2 only):
- 502 rate: 35% (baseline 0.05%)
- p99 latency: 4s (mostly DNS resolution timeouts in error trace)
- nginx upstream errors: "upstream timed out (110: Connection timed out) while connecting to upstream"

Recent changes:
- 21:40 UTC — DNS team changed TTL for internal-services.example.com from 300s → 30s as part of a planned migration
- Also changed CNAME target: internal-services.example.com now points to a NEW NLB (was pointing to a legacy ALB)
- Old ALB still up and answering, but its DNS records were not removed
- New NLB was created in us-east-1 only — no us-west-2 record exists

Logs:
- nginx in us-west-2 trying to resolve internal-services.example.com → getting NXDOMAIN or stale IPs
- ~30% of pods have cached the old ALB IP and are working; rest are failing
- us-east-1 pods all resolve correctly to the new NLB

On-call:
- 22:08 — pager
- 22:11 — DNS team confirms the planned change
- 22:12 — debate: revert DNS or hot-patch us-west-2 to point at us-east-1 NLB

Summary

SEV1

A planned DNS change at 21:40 UTC misconfigured the CNAME for internal-services.example.com to point to a new NLB that only exists in us-east-1, causing 35% of requests in us-west-2 to fail with 502 errors due to DNS resolution failures. The incident is ongoing with a partial workaround via cached DNS.

Severity reasoning: User-facing outage with 35% error rate > 1% for >5 minutes, revenue path broken (502 errors on API gateway), and regional unavailability in us-west-2. Meets SEV1 criteria.

deepseek-chat·prompt v2·output: en·11496ms·1784↑ / 1708↓ tok·$0.00236

Root cause hypotheses

  • highDNS CNAME target points to a NLB that does not exist in us-west-2, causing NXDOMAIN for pods that resolve the new record.

    Evidence: Logs show nginx in us-west-2 getting NXDOMAIN for internal-services.example.com; new NLB only created in us-east-1.

  • mediumTTL reduction from 300s to 30s caused rapid propagation of the bad record, overwhelming pods with stale or missing DNS.

    Evidence: DNS change at 21:40 UTC included TTL change; 35% failure rate correlates with DNS cache expiry.

  • mediumOld ALB DNS records were not removed, causing split-brain DNS where some pods resolve to old ALB (working) and others to new NLB (failing).

    Evidence: ~30% of pods working (cached old ALB IP); rest failing; old ALB still up but its DNS not removed.

  • lowNetwork connectivity issue between us-west-2 and the new NLB (e.g., cross-region latency or security group).

    Evidence: No evidence of connectivity issues; us-east-1 works fine; error is DNS resolution timeout, not connection timeout.

Investigation checklist

  1. Check DNS resolution for internal-services.example.com from a pod in us-west-2.
    kubectl exec -n prod -l app=api-gateway -- nslookup internal-services.example.com 2>&1 | head -20

    Expected: Should show NXDOMAIN or an IP not matching the new NLB. If it shows the old ALB IP, the pod has cached DNS.

  2. Verify the new NLB exists in us-west-2.
    aws elbv2 describe-load-balancers --region us-west-2 --names internal-services-nlb 2>&1

    Expected: Should return the NLB details. If error 'LoadBalancerNotFound', confirms missing NLB.

  3. Check DNS records for internal-services.example.com in Route53.
    aws route53 list-resource-record-sets --hosted-zone-id ZONE_ID --query "ResourceRecordSets[?Name=='internal-services.example.com.']"

    Expected: Should show CNAME pointing to new NLB DNS name. Verify target exists in us-west-2.

  4. Check nginx error logs for upstream timeout details.
    kubectl logs -n prod -l app=api-gateway --since=30m | grep 'upstream timed out' | head -10

    Expected: Should show 'Connection timed out while connecting to upstream' with IP addresses. Confirm those IPs are not from us-west-2 NLB.

  5. Check DNS cache TTL and resolution statistics on a failing pod.
    kubectl exec -n prod -l app=api-gateway -- cat /etc/resolv.conf && kubectl exec -n prod -l app=api-gateway -- nslookup -debug internal-services.example.com 2>&1 | tail -20

    Expected: Should show TTL and resolution attempts. If NXDOMAIN, confirms DNS issue.

Mitigation plan

  • Revert DNS change: point internal-services.example.com CNAME back to the old ALB in us-west-2.

    Risk: Low risk; old ALB is still operational. May cause brief DNS propagation delay.

    Rollback: Re-apply the new CNAME target after fixing the NLB deployment.

  • Create a new NLB in us-west-2 and update DNS to point to it (if revert is not preferred).

    Risk: Medium risk; requires creating NLB, configuring target groups, and updating DNS. May take 10-15 minutes.

    Rollback: Revert DNS to old ALB if new NLB fails.

  • Hot-patch us-west-2 pods to use us-east-1 NLB endpoint via environment variable or configmap override.

    Risk: Medium risk; cross-region latency may increase p99, and us-east-1 NLB may not be configured for us-west-2 traffic.

    Rollback: Revert the configmap or environment variable change.

Customer impact

Approximately 35% of API requests to us-west-2 are failing with HTTP 502 errors. Users experience timeouts or errors when accessing services. us-east-1 users are unaffected. No ETA yet.

Postmortem draft

Summary

[FILL IN]

Timeline (UTC)

  • 21:40: DNS team changes TTL and CNAME for internal-services.example.com
  • 22:08: Pager triggered for 502 errors in us-west-2
  • 22:10: Incident declared
  • 22:11: DNS team confirms change
  • 22:12: Mitigation options discussed [FILL IN rest]

Impact

35% error rate on api-gateway in us-west-2, p99 latency 4s, affecting all downstream services.

Root Cause

DNS CNAME pointed to a NLB that did not exist in us-west-2, causing DNS resolution failures for pods that did not have cached old records.

Detection

Monitoring alert for 502 error rate spike.

Response

[FILL IN]

What Went Well

  • Quick identification of DNS change as root cause
  • Partial mitigation via cached DNS

What Went Poorly

  • DNS change not validated across regions
  • No pre-deployment check for NLB existence

Action Items

  • [FILL IN]

Follow-ups

  • P0Implement pre-deployment DNS validation to ensure target endpoints exist in all regions.platform team
  • P1Add monitoring for DNS resolution failures per service per region.observability team
  • P1Review DNS change process to require multi-region validation.DNS team
  • P2Create runbook for DNS-related incidents with rollback steps.on-call SRE
  • P2Evaluate using service mesh or internal DNS with health checks to avoid single points of failure.platform team