[Eval][v2][en] Regional 5xx spike after DNS TTL change
service: api-gatewaycreated: 6/3/2026, 12:02:26 AM
Raw incident context
Time: 22:10 UTC. api-gateway in us-west-2 returning HTTP 502 for ~35% of requests. us-east-1 unaffected. Symptoms (us-west-2 only): - 502 rate: 35% (baseline 0.05%) - p99 latency: 4s (mostly DNS resolution timeouts in error trace) - nginx upstream errors: "upstream timed out (110: Connection timed out) while connecting to upstream" Recent changes: - 21:40 UTC — DNS team changed TTL for internal-services.example.com from 300s → 30s as part of a planned migration - Also changed CNAME target: internal-services.example.com now points to a NEW NLB (was pointing to a legacy ALB) - Old ALB still up and answering, but its DNS records were not removed - New NLB was created in us-east-1 only — no us-west-2 record exists Logs: - nginx in us-west-2 trying to resolve internal-services.example.com → getting NXDOMAIN or stale IPs - ~30% of pods have cached the old ALB IP and are working; rest are failing - us-east-1 pods all resolve correctly to the new NLB On-call: - 22:08 — pager - 22:11 — DNS team confirms the planned change - 22:12 — debate: revert DNS or hot-patch us-west-2 to point at us-east-1 NLB
Summary
A planned DNS change for internal-services.example.com at 21:40 UTC caused a partial outage in us-west-2, where the new NLB target only exists in us-east-1. Approximately 35% of requests to api-gateway in us-west-2 are failing with HTTP 502 due to DNS resolution timeouts, as pods that resolve the new CNAME get NXDOMAIN or stale IPs. The incident is ongoing with a severity of SEV1 due to user-facing errors exceeding 1% for more than 5 minutes.
Severity reasoning: User-facing outage: 35% error rate (502) for >5 minutes, exceeding the SEV1 threshold of >1% error rate for >5 min. Revenue path is broken as api-gateway is the entry point for customer requests.
deepseek-chat·prompt v2·output: en·15109ms·1784↑ / 1954↓ tok·$0.00263
Root cause hypotheses
- highDNS change deployed new CNAME pointing to a NLB that does not exist in us-west-2, causing resolution failures for pods that query the new record.
Evidence: Logs show nginx in us-west-2 getting NXDOMAIN or stale IPs for internal-services.example.com. DNS team confirmed the change at 21:40 UTC.
- highStale DNS cache in some pods causes them to use old ALB IPs, while others fail; the 35% error rate matches the proportion of pods that have refreshed DNS.
Evidence: ~30% of pods are working (cached old ALB IP), rest failing. DNS TTL was reduced from 300s to 30s, accelerating cache refresh and causing more pods to hit the non-existent NLB.
- mediumThe new NLB in us-east-1 is misconfigured and not accepting traffic from us-west-2, causing cross-region latency or timeouts.
Evidence: Error traces show 'Connection timed out' during upstream connection, but us-east-1 pods work fine. Cross-region latency could cause timeouts if us-west-2 pods tried to reach us-east-1 NLB.
- lowA recent deployment or configuration change to api-gateway or its upstream services caused the 502 errors, unrelated to DNS.
Evidence: No other changes reported. The timing aligns exactly with the DNS change. us-east-1 unaffected, which would be unlikely if it were a code change.
Investigation checklist
- Check DNS resolution for internal-services.example.com from a pod in us-west-2.
kubectl exec -n prod -l app=api-gateway -- nslookup internal-services.example.com 2>&1 | head -20Expected: If failing, output shows NXDOMAIN or an IP in us-east-1 (not us-west-2). If working, shows a valid us-west-2 IP.
- Verify the new NLB exists in us-west-2.
aws elbv2 describe-load-balancers --region us-west-2 --names internal-services-nlb 2>&1Expected: If missing, command returns 'LoadBalancerNotFound'. If present, returns NLB details.
- Check nginx upstream error logs for DNS resolution failures.
kubectl logs -n prod -l app=api-gateway --since=30m | grep -iE 'upstream timed out|connection timed out|NXDOMAIN' | tail -20Expected: Should show 'upstream timed out (110: Connection timed out) while connecting to upstream' and possibly 'internal-services.example.com could not be resolved'.
- Compare DNS resolution from us-west-2 vs us-east-1 pods.
kubectl exec -n prod -l app=api-gateway -- nslookup internal-services.example.com 2>&1 | grep Address; kubectl exec -n prod -l app=api-gateway --context=us-east-1 -- nslookup internal-services.example.com 2>&1 | grep AddressExpected: us-west-2 returns NXDOMAIN or stale IP; us-east-1 returns the new NLB IP.
- Check if the old ALB is still serving and has the correct target group.
aws elbv2 describe-target-groups --region us-west-2 --load-balancer-arn <old-alb-arn> 2>&1 | jq '.TargetGroups[].TargetGroupArn'Expected: Old ALB should have healthy targets. If not, it may be draining.
Mitigation plan
Revert the DNS change: point internal-services.example.com back to the old ALB in us-west-2.
Risk: Low risk; reverts to known working state. May cause brief propagation delay (TTL 30s).
Rollback: Re-apply the new DNS record if needed after fixing the NLB deployment.
Create a new NLB in us-west-2 and update the DNS CNAME to point to it.
Risk: Medium risk; requires creating NLB and configuring target groups. May take 5-10 minutes. If misconfigured, could worsen outage.
Rollback: Revert DNS to old ALB as above.
Hot-patch us-west-2 pods to use the us-east-1 NLB endpoint directly via environment variable or configmap, bypassing DNS.
Risk: High risk; cross-region traffic increases latency and cost. May cause timeouts if not properly routed. Not a long-term fix.
Rollback: Revert the configmap or environment variable and restart pods.
Customer impact
Approximately 35% of requests to the API gateway in the US West (Oregon) region are failing with HTTP 502 errors. Users in that region may experience intermittent failures or timeouts when accessing services. The US East (Virginia) region is unaffected. Estimated affected users: 35% of traffic in us-west-2. No ETA for full resolution.
Postmortem draft
Summary
[FILL IN: 2-3 sentence summary]
Timeline (UTC)
- 21:40: DNS team changes TTL and CNAME for internal-services.example.com
- 22:08: Pager alert for 35% 502 errors in us-west-2
- 22:10: Incident declared
- 22:11: DNS team confirms planned change
- 22:12: Debate on mitigation approach
- [FILL IN: resolution time]
Impact
- 35% error rate (502) on api-gateway in us-west-2
- p99 latency 4s
- us-east-1 unaffected
- Estimated affected users: [FILL IN]
Root Cause
DNS change deployed a CNAME pointing to a NLB that only exists in us-east-1, causing DNS resolution failures for pods in us-west-2.
Detection
Pager alert triggered by elevated 502 error rate.
Response
[FILL IN: actions taken]
What Went Well
- Quick identification of DNS change as likely cause
- Clear communication from DNS team
What Went Poorly
- DNS change not validated across all regions before deployment
- No pre-change review or rollback plan
- Monitoring did not catch the issue immediately
Action Items
- [FILL IN: action items from follow_ups]
Follow-ups
- P0Implement pre-deployment validation for DNS changes to verify records exist in all regions.— DNS team
- P1Add monitoring for DNS resolution failures from application pods.— observability team
- P1Review and document rollback procedure for DNS changes.— DNS team
- P2Evaluate if cross-region NLB should be created proactively for failover.— infrastructure team
- P2Update incident response runbook for DNS-related outages.— SRE team
Similar past incidents
lexical match (pg_trgm)
- 69%
[Eval][v2][en] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 62%
[Eval][v1][en] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 41%
[Eval][v2][zh] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 40%
[Eval][v1][zh] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 28%
[Eval][v1][en] Order service OOM crashloop following v3.7 deploy
Pods OOM-killed every ~20min, restart loop, p99 latency degraded, ~3% requests timing out