[Eval][v1][en] Regional 5xx spike after DNS TTL change
service: api-gatewaycreated: 5/25/2026, 10:22:33 PM
Raw incident context
Time: 22:10 UTC. api-gateway in us-west-2 returning HTTP 502 for ~35% of requests. us-east-1 unaffected. Symptoms (us-west-2 only): - 502 rate: 35% (baseline 0.05%) - p99 latency: 4s (mostly DNS resolution timeouts in error trace) - nginx upstream errors: "upstream timed out (110: Connection timed out) while connecting to upstream" Recent changes: - 21:40 UTC — DNS team changed TTL for internal-services.example.com from 300s → 30s as part of a planned migration - Also changed CNAME target: internal-services.example.com now points to a NEW NLB (was pointing to a legacy ALB) - Old ALB still up and answering, but its DNS records were not removed - New NLB was created in us-east-1 only — no us-west-2 record exists Logs: - nginx in us-west-2 trying to resolve internal-services.example.com → getting NXDOMAIN or stale IPs - ~30% of pods have cached the old ALB IP and are working; rest are failing - us-east-1 pods all resolve correctly to the new NLB On-call: - 22:08 — pager - 22:11 — DNS team confirms the planned change - 22:12 — debate: revert DNS or hot-patch us-west-2 to point at us-east-1 NLB
Summary
A DNS change for internal-services.example.com at 21:40 UTC caused a partial outage in us-west-2: the new CNAME target (NLB) only exists in us-east-1, so ~65% of pods in us-west-2 fail to resolve the new target, resulting in 35% 502 errors and elevated latency. The remaining 35% of pods still have cached old ALB IPs and work normally.
Severity reasoning: User-facing outage with 35% error rate and significant latency increase (p99 4s) in us-west-2, impacting a large subset of users. Revenue impact likely. Reversibility is high (revert DNS change), but immediate mitigation required.
deepseek-chat·prompt v1·output: en·10490ms
Root cause hypotheses
- highDNS CNAME target points to NLB that does not exist in us-west-2
Evidence: DNS team changed CNAME for internal-services.example.com to a new NLB created only in us-east-1; us-west-2 pods get NXDOMAIN or stale IPs.
- highDNS TTL reduction from 300s to 30s caused rapid propagation of bad records
Evidence: TTL changed at 21:40 UTC; incident started shortly after. Lower TTL means faster adoption of the bad record.
- lowOld ALB DNS records were removed, causing fallback to fail
Evidence: Old ALB still up but its DNS records were not removed; however, pods that cached old IPs work, suggesting removal not the issue.
Investigation checklist
- Verify DNS resolution for internal-services.example.com from us-west-2 pods
kubectl exec -n <namespace> <pod> -- nslookup internal-services.example.comExpected: Should return the new NLB IP (us-east-1) or NXDOMAIN; currently likely NXDOMAIN or stale IP.
- Check if the new NLB exists in us-west-2
aws elbv2 describe-load-balancers --region us-west-2 --names <new-nlb-name>Expected: Should return load balancer details; currently likely 'LoadBalancerNotFound'.
- Check nginx upstream configuration for DNS caching
kubectl exec -n <namespace> <pod> -- cat /etc/nginx/nginx.conf | grep -A5 resolverExpected: Should show resolver directive and valid_names; check if resolver is set to cluster DNS.
- Confirm us-east-1 resolution works
kubectl exec -n <namespace> -c <container> -- nslookup internal-services.example.comExpected: Should return the new NLB IP in us-east-1.
- Check DNS records for internal-services.example.com
dig internal-services.example.com @<dns-server> +shortExpected: Should show CNAME to new NLB; currently shows new NLB (us-east-1 IP).
Mitigation plan
Immediately revert the DNS change: set CNAME back to old ALB and restore TTL to 300s
Risk: Brief propagation delay; no destructive operations.
Rollback: Re-apply the new DNS change after creating NLB in us-west-2.
If revert is not possible, create a new NLB in us-west-2 and update DNS to point to it
Risk: Takes time; may require cross-region configuration.
Rollback: Revert DNS to old ALB.
Customer impact
Approximately 35% of requests to api-gateway in us-west-2 are failing with HTTP 502 errors. Affected users experience timeouts or errors. us-east-1 users are unaffected. ETA: immediate mitigation via DNS revert expected within 5 minutes.
Postmortem draft
Postmortem: DNS Change Causes Partial Outage in us-west-2
Summary: A planned DNS change (TTL reduction and CNAME update) pointed internal-services.example.com to a new NLB that only existed in us-east-1, causing 35% 502 errors in us-west-2.
Timeline:
- 21:40 UTC: DNS team changes TTL 300s→30s and CNAME to new NLB (us-east-1 only)
- 22:08 UTC: Pager alert for 502 errors in us-west-2
- 22:11 UTC: DNS team confirms change
- 22:12 UTC: Decision to revert DNS
- 22:15 UTC: DNS reverted, errors subside
Impact: 35% error rate for us-west-2 api-gateway, p99 latency 4s, duration ~7 minutes.
Root Cause: New NLB not deployed in us-west-2; DNS change propagated before infrastructure was ready.
What Went Well: Quick detection, clear communication with DNS team.
What Went Poorly: No pre-flight check for NLB existence in all regions; DNS change not coordinated with infrastructure readiness.
Action Items:
- [ ] Implement pre-deployment DNS validation (check target existence in all regions)
- [ ] Add canary DNS rollout with automatic rollback
- [ ] Review change management process for DNS updates
Follow-ups
- P0Create NLB in us-west-2 for internal-services.example.com— platform team
- P1Add pre-deployment check to DNS change pipeline to verify target existence in all regions— platform team
- P1Implement canary DNS rollout with automatic rollback on error rate increase— SRE team
- P2Review and update incident response runbook for DNS-related outages— on-call SRE
- P2Add monitoring for DNS resolution failures per region— observability team
Similar past incidents
lexical match (pg_trgm)
- 64%
[Eval][v2][en] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 45%
[Eval][v1][zh] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 44%
[Eval][v2][zh] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 28%
[Eval][v2][en] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted
- 27%
[Eval][v1][en] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted