[Eval][v3][en] Regional 5xx spike after DNS TTL change
service: api-gatewaycreated: 6/3/2026, 12:03:58 AM
Raw incident context
Time: 22:10 UTC. api-gateway in us-west-2 returning HTTP 502 for ~35% of requests. us-east-1 unaffected. Symptoms (us-west-2 only): - 502 rate: 35% (baseline 0.05%) - p99 latency: 4s (mostly DNS resolution timeouts in error trace) - nginx upstream errors: "upstream timed out (110: Connection timed out) while connecting to upstream" Recent changes: - 21:40 UTC — DNS team changed TTL for internal-services.example.com from 300s → 30s as part of a planned migration - Also changed CNAME target: internal-services.example.com now points to a NEW NLB (was pointing to a legacy ALB) - Old ALB still up and answering, but its DNS records were not removed - New NLB was created in us-east-1 only — no us-west-2 record exists Logs: - nginx in us-west-2 trying to resolve internal-services.example.com → getting NXDOMAIN or stale IPs - ~30% of pods have cached the old ALB IP and are working; rest are failing - us-east-1 pods all resolve correctly to the new NLB On-call: - 22:08 — pager - 22:11 — DNS team confirms the planned change - 22:12 — debate: revert DNS or hot-patch us-west-2 to point at us-east-1 NLB
Summary
A DNS change at 21:40 UTC for internal-services.example.com (TTL reduced from 300s to 30s, CNAME pointed to a new NLB created only in us-east-1) caused 35% of requests to api-gateway in us-west-2 to fail with HTTP 502 and p99 latency of 4s due to DNS resolution timeouts. The new NLB has no us-west-2 record, so pods in us-west-2 that resolved the new target get NXDOMAIN or stale IPs, while ~30% of pods with cached old ALB IPs continue to work. us-east-1 is unaffected.
Severity reasoning: User-facing outage: 35% error rate (502) for >5 minutes, exceeding the SEV1 threshold of >1% for >5 min. Revenue path is broken as api-gateway is the entry point for customer requests. Evidence: '35% 502 errors' and 'p99 4s' with 'nginx upstream errors: upstream timed out'.
deepseek-chat·prompt v3·output: en·16933ms·2035↑ / 2267↓ tok·$0.00304
Root cause hypotheses
- highDNS change misconfiguration: new NLB not deployed in us-west-2, causing resolution failures for pods that fetch the new CNAME target.
Evidence: Logs show nginx in us-west-2 getting NXDOMAIN or stale IPs for internal-services.example.com. DNS team confirmed CNAME changed to new NLB which exists only in us-east-1.
- highStale DNS cache in some pods: ~30% of pods still have old ALB IP cached and work, while others fail due to TTL reduction causing rapid cache expiration.
Evidence: Logs indicate ~30% of pods have cached old ALB IP and are working; rest are failing. TTL was reduced from 300s to 30s, accelerating cache turnover.
- lowNew NLB health check failure: new NLB in us-east-1 is unhealthy or misconfigured, causing DNS resolution to fail even when pods resolve correctly.
Evidence: No evidence of health check issues; us-east-1 pods resolve correctly and work fine. This hypothesis is unlikely given us-east-1 is normal.
- lowNetwork connectivity issue between us-west-2 and new NLB in us-east-1: cross-region latency or firewall blocking connections.
Evidence: DNS resolution fails before any connection attempt; errors are DNS timeouts, not connection timeouts after resolution. No evidence of network path issues.
Investigation checklist
- Verify DNS resolution for internal-services.example.com from a pod in us-west-2.
kubectl exec -n prod -it deploy/api-gateway -- nslookup internal-services.example.com 2>&1 | head -20Expected: Should show NXDOMAIN or an IP not in us-west-2 (e.g., us-east-1 NLB IP). If working, should show old ALB IP.
- Check the DNS record set for internal-services.example.com in Route53.
aws route53 list-resource-record-sets --hosted-zone-id $(aws route53 list-hosted-zones --query 'HostedZones[?Name==`internal-services.example.com.`].Id' --output text) --query 'ResourceRecordSets[?Name==`internal-services.example.com.`]'Expected: CNAME pointing to new NLB DNS name. Verify NLB exists in us-west-2; if not, that's the root cause.
- Check NLB existence in us-west-2.
aws elbv2 describe-load-balancers --region us-west-2 --names 'new-nlb-name' 2>&1Expected: If NLB not found, confirms missing us-west-2 NLB. If found, check its state and target group health.
- Check nginx upstream error logs for DNS resolution failures.
kubectl logs -n prod -l app=api-gateway --since=30m | grep -E 'upstream timed out|could not be resolved' | tail -20Expected: Lines showing 'upstream timed out (110: Connection timed out) while connecting to upstream' or 'could not be resolved'.
- Check the percentage of pods with cached old ALB IP vs new NLB IP.
kubectl exec -n prod deploy/api-gateway -- sh -c 'for pod in $(kubectl get pods -n prod -l app=api-gateway -o name); do echo $pod; kubectl exec -n prod $pod -- nslookup internal-services.example.com 2>&1 | grep Address; done' 2>&1 | head -50Expected: Mix of old ALB IPs and NXDOMAIN or new NLB IPs. ~30% should show old ALB IP.
Mitigation plan
Immediately revert the DNS change: set CNAME back to old ALB and restore TTL to 300s.
Risk: Low risk; reverts to known working state. Old ALB is still up and answering.
Rollback: Re-apply the DNS change (CNAME to new NLB, TTL 30s) if needed after fixing the NLB deployment.
If revert is not possible, create a new NLB in us-west-2 and update DNS to point to it, or use a global accelerator.
Risk: Medium risk; creating NLB takes time and may have configuration errors. DNS propagation delay.
Rollback: Delete the new NLB and revert DNS to old ALB.
Hot-patch us-west-2 pods to use a static IP or a local DNS override pointing to the old ALB IP.
Risk: Medium risk; manual changes to pod config may cause inconsistency. Requires rolling restart.
Rollback: Revert the ConfigMap or environment variable and restart pods.
Customer impact
Approximately 35% of requests to the API gateway in the US West (Oregon) region are failing with HTTP 502 errors. Affected users see 'Bad Gateway' errors and experience timeouts. The US East (Virginia) region is unaffected. Estimated impact: 35% of traffic in us-west-2, which is a significant portion of overall traffic. No ETA for full resolution yet.
Postmortem draft
Summary
A DNS change for internal-services.example.com (TTL reduced from 300s to 30s, CNAME pointed to a new NLB created only in us-east-1) caused a partial outage in us-west-2. 35% of requests to api-gateway returned 502 errors due to DNS resolution failures.
Timeline (UTC)
- 21:40 - DNS team changes TTL and CNAME for internal-services.example.com
- 22:08 - Pager alert for elevated 502 errors in us-west-2
- 22:10 - Incident declared; investigation begins
- 22:11 - DNS team confirms change
- 22:12 - Debate on mitigation: revert vs hot-patch
- [FILL IN] - Mitigation applied (revert or fix)
- [FILL IN] - Incident resolved
Impact
- 35% error rate (502) on api-gateway in us-west-2
- p99 latency 4s
- us-east-1 unaffected
- Estimated affected users: [FILL IN] based on traffic share
Root Cause
The new NLB was not deployed in us-west-2. Pods that resolved the new CNAME target received NXDOMAIN or stale IPs, causing connection timeouts. ~30% of pods with cached old ALB IPs continued to work.
Detection
Pager alert at 22:08 based on elevated error rate. Monitoring detected the spike within 2 minutes of the change.
Response
- [FILL IN] steps taken
What Went Well
- Monitoring detected the issue quickly
- DNS team was available to confirm the change
What Went Poorly
- DNS change was not validated in a staging environment
- No pre-change review for cross-region impact
- Rollback plan was not prepared
Action Items
- [ ] Ensure new NLB is deployed in all regions before DNS cutover
- [ ] Add pre-change validation checklist for DNS changes
- [ ] Implement canary DNS rollout with gradual traffic shift
- [ ] Add monitoring for DNS resolution failures per region
Follow-ups
- P0Deploy new NLB in us-west-2 and update DNS to point to it, or revert DNS to old ALB permanently.— platform-team
- P1Add pre-change validation checklist for DNS changes requiring multi-region resource existence.— platform-team
- P1Implement canary DNS rollout with gradual traffic shift and automated rollback on error rate increase.— platform-team
- P1Add monitoring for DNS resolution failures per region and per service.— observability-team
- P2Review and update incident response runbook for DNS-related outages.— on-call SRE
- P1Conduct a postmortem within 48 hours with DNS team and platform team.— incident-commander
Similar past incidents
lexical match (pg_trgm)
- 68%
[Eval][v1][en] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 68%
[Eval][v2][en] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 41%
[Eval][v2][zh] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 40%
[Eval][v1][zh] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 29%
[Eval][v1][en] Order service OOM crashloop following v3.7 deploy
Pods OOM-killed every ~20min, restart loop, p99 latency degraded, ~3% requests timing out