[Eval][v1][en] Regional 5xx spike after DNS TTL change
service: api-gatewaycreated: 6/2/2026, 11:39:31 PM
Raw incident context
Time: 22:10 UTC. api-gateway in us-west-2 returning HTTP 502 for ~35% of requests. us-east-1 unaffected. Symptoms (us-west-2 only): - 502 rate: 35% (baseline 0.05%) - p99 latency: 4s (mostly DNS resolution timeouts in error trace) - nginx upstream errors: "upstream timed out (110: Connection timed out) while connecting to upstream" Recent changes: - 21:40 UTC — DNS team changed TTL for internal-services.example.com from 300s → 30s as part of a planned migration - Also changed CNAME target: internal-services.example.com now points to a NEW NLB (was pointing to a legacy ALB) - Old ALB still up and answering, but its DNS records were not removed - New NLB was created in us-east-1 only — no us-west-2 record exists Logs: - nginx in us-west-2 trying to resolve internal-services.example.com → getting NXDOMAIN or stale IPs - ~30% of pods have cached the old ALB IP and are working; rest are failing - us-east-1 pods all resolve correctly to the new NLB On-call: - 22:08 — pager - 22:11 — DNS team confirms the planned change - 22:12 — debate: revert DNS or hot-patch us-west-2 to point at us-east-1 NLB
Summary
A DNS change at 21:40 UTC for internal-services.example.com (TTL reduced from 300s to 30s, CNAME target switched from a legacy ALB to a new NLB) caused a partial outage in us-west-2 because the new NLB only exists in us-east-1. Approximately 35% of requests to api-gateway in us-west-2 are failing with 502 errors due to DNS resolution failures, while pods with cached old ALB IPs continue to work. us-east-1 is unaffected.
Severity reasoning: User-facing outage affecting 35% of requests in us-west-2 with elevated error rates and latency, impacting revenue and customer experience. The scope is regional but significant, and the issue is reversible by reverting the DNS change or adding a us-west-2 NLB.
deepseek-chat·prompt v1·output: en·11088ms·1403↑ / 1707↓ tok·$0.00226
Root cause hypotheses
- highNew NLB was only created in us-east-1, causing DNS resolution failures in us-west-2 when clients query the new CNAME target.
Evidence: DNS team confirmed new NLB is in us-east-1 only; us-west-2 pods get NXDOMAIN or stale IPs; us-east-1 works fine.
- highReduced TTL from 300s to 30s caused rapid propagation of the bad DNS record, overwhelming pods that previously had cached the old ALB IP.
Evidence: TTL change at 21:40 UTC; incident started shortly after; 30% of pods still work (cached old IP).
- mediumDNS team accidentally deleted the us-west-2 DNS record for internal-services.example.com during the migration.
Evidence: Logs show NXDOMAIN for some queries; no explicit confirmation of deletion, but consistent with missing record.
- lowThe new NLB is misconfigured (e.g., health checks failing, listener missing) causing it to drop connections.
Evidence: No direct evidence; us-east-1 works, so NLB itself is likely fine.
Investigation checklist
- Verify DNS resolution for internal-services.example.com from a us-west-2 pod.
kubectl exec -n <namespace> <pod-name> -- nslookup internal-services.example.comExpected: Should return an IP in us-west-2; if NXDOMAIN or us-east-1 IP, confirms DNS issue.
- Check if the new NLB exists in us-west-2.
aws elbv2 describe-load-balancers --region us-west-2 --names <new-nlb-name>Expected: If empty, NLB does not exist in us-west-2.
- Check the DNS record set for internal-services.example.com in Route53.
aws route53 list-resource-record-sets --hosted-zone-id <zone-id> --query "ResourceRecordSets[?Name=='internal-services.example.com.']"Expected: Should show CNAME pointing to new NLB DNS name; verify if alias target is in us-west-2.
- Confirm that the old ALB is still healthy and reachable.
curl -I http://<old-alb-dns>/healthExpected: HTTP 200 response.
- Check nginx error logs for upstream connection timeouts.
kubectl logs -n <namespace> -l app=api-gateway --tail=100 | grep 'upstream timed out'Expected: Multiple lines showing connection timeouts to upstream.
Mitigation plan
Revert the DNS change: point internal-services.example.com back to the old ALB and restore TTL to 300s.
Risk: Low risk; old ALB is still healthy. May cause brief propagation delay.
Rollback: Re-apply the new DNS record if needed after fixing the NLB.
Alternatively, create a new NLB in us-west-2 and update the DNS alias target to point to the us-west-2 NLB.
Risk: Medium risk; creating NLB takes time and may have configuration errors. Requires careful testing.
Rollback: Revert DNS to old ALB if new NLB fails.
If immediate fix needed, add a static host entry in api-gateway pods to bypass DNS (e.g., /etc/hosts with old ALB IP).
Risk: Low risk but not scalable; manual per-pod update. Rollback by removing the entry.
Rollback: Remove the static entry and revert DNS.
Customer impact
Approximately 35% of requests to the API gateway in us-west-2 are failing with HTTP 502 errors. Affected users experience timeouts or errors when accessing services. us-east-1 users are unaffected. ETA for full resolution is 30 minutes after mitigation action.
Postmortem draft
Postmortem: API Gateway 502 Errors in us-west-2
Summary: A DNS change (TTL reduction and CNAME target update) caused partial outage in us-west-2 because the new NLB was only deployed in us-east-1.
Timeline:
- 21:40 UTC: DNS team changes TTL and CNAME for internal-services.example.com
- 22:10 UTC: 502 errors detected in us-west-2
- 22:08 UTC: Pager triggered
- 22:11 UTC: DNS team confirms change
- 22:12 UTC: Mitigation started (revert DNS)
- 22:XX UTC: Full recovery
Impact: 35% request failure rate in us-west-2 for ~X minutes.
Root Cause: New NLB not deployed in us-west-2; DNS change propagated before infrastructure was ready.
What went well: Fast detection, clear communication with DNS team.
What went poorly: No pre-flight check for multi-region readiness; change not reviewed by SRE.
Action Items:
- [ ] Add pre-deployment validation for DNS changes (check target existence in all regions)
- [ ] Implement canary DNS rollout with gradual TTL reduction
- [ ] Update change management process to require SRE approval for DNS changes affecting production
Follow-ups
- P0Create new NLB in us-west-2 and update DNS alias to point to it.— platform team
- P1Add pre-deployment validation script to check DNS target existence in all regions.— platform team
- P1Review and update change management process for DNS modifications.— service owner
- P2Implement monitoring for DNS resolution failures from application pods.— on-call SRE
- P2Conduct postmortem meeting within 48 hours.— service owner
Similar past incidents
lexical match (pg_trgm)
- 68%
[Eval][v1][en] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 61%
[Eval][v2][en] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 41%
[Eval][v1][zh] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 40%
[Eval][v2][zh] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 27%
[Eval][v2][en] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted