← All incidents

[Eval][v1][en] Regional 5xx spike after DNS TTL change

service: api-gatewaycreated: 6/3/2026, 12:00:58 AM

Raw incident context

Time: 22:10 UTC. api-gateway in us-west-2 returning HTTP 502 for ~35% of requests. us-east-1 unaffected.

Symptoms (us-west-2 only):
- 502 rate: 35% (baseline 0.05%)
- p99 latency: 4s (mostly DNS resolution timeouts in error trace)
- nginx upstream errors: "upstream timed out (110: Connection timed out) while connecting to upstream"

Recent changes:
- 21:40 UTC — DNS team changed TTL for internal-services.example.com from 300s → 30s as part of a planned migration
- Also changed CNAME target: internal-services.example.com now points to a NEW NLB (was pointing to a legacy ALB)
- Old ALB still up and answering, but its DNS records were not removed
- New NLB was created in us-east-1 only — no us-west-2 record exists

Logs:
- nginx in us-west-2 trying to resolve internal-services.example.com → getting NXDOMAIN or stale IPs
- ~30% of pods have cached the old ALB IP and are working; rest are failing
- us-east-1 pods all resolve correctly to the new NLB

On-call:
- 22:08 — pager
- 22:11 — DNS team confirms the planned change
- 22:12 — debate: revert DNS or hot-patch us-west-2 to point at us-east-1 NLB

Summary

SEV1

A DNS change for internal-services.example.com at 21:40 UTC caused a 35% 502 error rate and p99 latency spike to 4s in us-west-2 api-gateway. The new CNAME target (NLB) only exists in us-east-1, leaving us-west-2 pods unable to resolve the service. Pods with cached old ALB IPs continue to work, while others fail with DNS resolution timeouts.

Severity reasoning: User-facing outage with 35% error rate and significant latency degradation in a major region (us-west-2). Revenue impact likely. Scope is regional but affects a large percentage of traffic. Reversibility is high (revert DNS or add us-west-2 record), but immediate action required.

deepseek-chat·prompt v1·output: en·10731ms·1403↑ / 1371↓ tok·$0.00189

Root cause hypotheses

  • highDNS CNAME target points to NLB that does not exist in us-west-2

    Evidence: New NLB was created only in us-east-1; us-west-2 pods get NXDOMAIN or stale IPs for internal-services.example.com

  • mediumDNS TTL reduction from 300s to 30s caused rapid propagation of bad records

    Evidence: TTL change was made at 21:40 UTC, incident started shortly after; 30% of pods still have cached old ALB IPs

  • lowOld ALB DNS records not removed causing split-brain resolution

    Evidence: Old ALB still up but its DNS records were not removed; some pods may resolve to old ALB (working) while others get new NLB (failing)

Investigation checklist

  1. Verify DNS resolution for internal-services.example.com from a us-west-2 pod
    kubectl exec -n <namespace> <pod-name> -- nslookup internal-services.example.com

    Expected: Should return an IP in us-west-2; currently likely returns NXDOMAIN or an IP in us-east-1

  2. Check if the new NLB exists in us-west-2
    aws elbv2 describe-load-balancers --region us-west-2 --names <new-nlb-name>

    Expected: Should return load balancer details; currently likely returns empty or error indicating no NLB in us-west-2

  3. Check nginx upstream error logs for DNS resolution failures
    kubectl logs -n <namespace> -l app=api-gateway --tail=100 | grep 'upstream timed out'

    Expected: Lines showing 'upstream timed out (110: Connection timed out) while connecting to upstream'

  4. Confirm us-east-1 pods resolve correctly
    kubectl exec -n <namespace> -l app=api-gateway --region us-east-1 -- nslookup internal-services.example.com

    Expected: Returns the new NLB IP in us-east-1

Mitigation plan

  • Revert DNS change: point internal-services.example.com back to the old ALB CNAME and restore TTL to 300s

    Risk: Temporary increase in DNS traffic due to TTL change, but low risk; old ALB is still operational

    Rollback: Re-apply the new DNS record (CNAME to new NLB) if needed

  • If revert is not possible, create a new NLB in us-west-2 and update DNS to point to that NLB

    Risk: Time to create NLB and propagate DNS; may take 5-10 minutes; new NLB must be configured correctly

    Rollback: Delete the new NLB and revert DNS to old ALB

Customer impact

Approximately 35% of requests to api-gateway in us-west-2 are failing with HTTP 502 errors. Affected users experience timeouts or errors. us-east-1 users are unaffected. ETA for full resolution is 15-30 minutes pending DNS propagation.

Postmortem draft

Postmortem: api-gateway 502 errors in us-west-2

Summary: [2-3 sentences]

Timeline:

  • 21:40 UTC: DNS team changes TTL and CNAME for internal-services.example.com
  • 22:08 UTC: Pager alert for 502 errors
  • 22:10 UTC: Incident declared
  • [Mitigation time]: DNS reverted / NLB created

Impact: 35% error rate, p99 latency 4s, us-west-2 only

Root Cause: New NLB only existed in us-east-1; us-west-2 pods could not resolve

What went well: Quick detection, clear logs, DNS team available

What went poorly: No pre-validation of DNS change across regions, no canary

Action Items:

  • [ ] Add pre-change validation to check DNS targets exist in all regions
  • [ ] Implement canary DNS rollout
  • [ ] Add monitoring for DNS resolution failures per region

Follow-ups

  • P0Add pre-change validation to ensure DNS targets exist in all required regionsplatform team
  • P1Implement canary DNS rollout with gradual TTL reductionDNS team
  • P1Add monitoring for DNS resolution failures per region in api-gatewayon-call SRE
  • P2Review DNS change process to require cross-region validationservice owner