[Eval][v1][en] Regional 5xx spike after DNS TTL change
service: api-gatewaycreated: 6/3/2026, 12:00:41 AM
Raw incident context
Time: 22:10 UTC. api-gateway in us-west-2 returning HTTP 502 for ~35% of requests. us-east-1 unaffected. Symptoms (us-west-2 only): - 502 rate: 35% (baseline 0.05%) - p99 latency: 4s (mostly DNS resolution timeouts in error trace) - nginx upstream errors: "upstream timed out (110: Connection timed out) while connecting to upstream" Recent changes: - 21:40 UTC — DNS team changed TTL for internal-services.example.com from 300s → 30s as part of a planned migration - Also changed CNAME target: internal-services.example.com now points to a NEW NLB (was pointing to a legacy ALB) - Old ALB still up and answering, but its DNS records were not removed - New NLB was created in us-east-1 only — no us-west-2 record exists Logs: - nginx in us-west-2 trying to resolve internal-services.example.com → getting NXDOMAIN or stale IPs - ~30% of pods have cached the old ALB IP and are working; rest are failing - us-east-1 pods all resolve correctly to the new NLB On-call: - 22:08 — pager - 22:11 — DNS team confirms the planned change - 22:12 — debate: revert DNS or hot-patch us-west-2 to point at us-east-1 NLB
Summary
A DNS change for internal-services.example.com at 21:40 UTC caused a 35% 502 error rate in us-west-2 api-gateway because the new CNAME target (NLB) only exists in us-east-1. Pods with cached old ALB IPs work; others fail with DNS resolution timeouts.
Severity reasoning: User-facing outage with 35% error rate and elevated latency (p99 4s) in us-west-2, affecting a significant portion of traffic. Revenue impact likely. Reversibility is high (revert DNS change), but immediate user impact qualifies as SEV1.
deepseek-chat·prompt v1·output: en·12153ms·1403↑ / 1497↓ tok·$0.00203
Root cause hypotheses
- highDNS CNAME target points to an NLB that does not exist in us-west-2, causing resolution failures for pods without cached old ALB IPs.
Evidence: DNS team changed CNAME to new NLB which was created only in us-east-1; us-west-2 pods get NXDOMAIN or stale IPs; nginx logs show 'upstream timed out' with DNS resolution timeouts.
- highTTL reduction from 300s to 30s caused rapid propagation of the bad DNS record, increasing the failure rate as caches expired.
Evidence: TTL changed at 21:40 UTC; incident started shortly after; 30% of pods still have old cached IPs (pre-change) and work.
- lowThe old ALB was decommissioned or its DNS removed, but logs show it is still up and answering; this is not the cause.
Evidence: Old ALB still up; pods with cached old IPs work fine.
Investigation checklist
- Verify DNS resolution for internal-services.example.com from a pod in us-west-2.
kubectl exec -n <namespace> <pod-name> -- nslookup internal-services.example.comExpected: Should return the new NLB IP (us-east-1) or NXDOMAIN; if old ALB IP, pod has cached entry.
- Check if the new NLB exists in us-west-2.
aws elbv2 describe-load-balancers --region us-west-2 --names <new-nlb-name>Expected: If empty, NLB does not exist in us-west-2, confirming the root cause.
- Check nginx upstream configuration to see if it uses DNS for upstream resolution.
kubectl exec -n <namespace> <pod-name> -- cat /etc/nginx/conf.d/default.conf | grep -A5 upstreamExpected: Should show 'resolver' directive and upstream pointing to internal-services.example.com.
- Check the TTL of the DNS record from a pod.
kubectl exec -n <namespace> <pod-name> -- dig +short internal-services.example.comExpected: TTL should be 30s; if still 300s, pod has not refreshed.
- Confirm that us-east-1 pods resolve correctly.
kubectl exec -n <namespace> -c <container> -- nslookup internal-services.example.comExpected: Should return the new NLB IP in us-east-1.
Mitigation plan
Revert the DNS change: point internal-services.example.com back to the old ALB CNAME and restore TTL to 300s.
Risk: Low risk; old ALB is still operational. DNS propagation may take a few minutes.
Rollback: Re-apply the new DNS change if needed after fixing the NLB deployment.
Alternatively, create the new NLB in us-west-2 and update DNS to include both regions (e.g., using Route53 latency routing).
Risk: Takes longer; requires creating NLB and updating DNS; may cause additional errors during creation.
Rollback: Delete the new NLB in us-west-2 and revert DNS to old ALB.
Customer impact
Approximately 35% of requests to api-gateway in us-west-2 are failing with HTTP 502 errors. Users experience timeouts or errors. us-east-1 is unaffected. ETA: mitigation within 15 minutes if DNS is reverted.
Postmortem draft
Postmortem: api-gateway 502 errors in us-west-2
Summary: DNS change for internal-services.example.com pointed to an NLB that only existed in us-east-1, causing 35% of requests in us-west-2 to fail.
Timeline:
- 21:40 UTC: DNS team changes TTL and CNAME for internal-services.example.com
- 22:10 UTC: 502 errors detected
- 22:08 UTC: Pager
- 22:11 UTC: DNS team confirms change
- 22:12 UTC: Mitigation started (revert DNS)
Impact: 35% error rate, p99 latency 4s, us-west-2 only.
Root Cause: New NLB not deployed in us-west-2; DNS change propagated before infrastructure was ready.
What went well: Quick detection, clear logs, DNS team available.
What went poorly: No pre-flight check for NLB existence in all regions; TTL reduction accelerated failure.
Action Items:
- [ ] Add pre-deployment validation that DNS targets exist in all regions
- [ ] Implement canary DNS changes with gradual TTL reduction
- [ ] Update runbook for DNS-related incidents
Follow-ups
- P1Add pre-deployment validation script to check DNS target existence in all regions.— platform team
- P1Implement canary DNS changes with gradual TTL reduction and monitoring.— DNS team
- P2Update incident runbook for DNS-related api-gateway issues.— on-call SRE
- P2Review change management process for DNS modifications.— service owner
Similar past incidents
lexical match (pg_trgm)
- 73%
[Eval][v1][en] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 69%
[Eval][v2][en] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 50%
[Eval][v1][zh] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 47%
[Eval][v2][zh] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 26%
[Eval][v2][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts