[Eval][v2][en] Regional 5xx spike after DNS TTL change
service: api-gatewaycreated: 6/2/2026, 11:39:59 PM
Raw incident context
Time: 22:10 UTC. api-gateway in us-west-2 returning HTTP 502 for ~35% of requests. us-east-1 unaffected. Symptoms (us-west-2 only): - 502 rate: 35% (baseline 0.05%) - p99 latency: 4s (mostly DNS resolution timeouts in error trace) - nginx upstream errors: "upstream timed out (110: Connection timed out) while connecting to upstream" Recent changes: - 21:40 UTC — DNS team changed TTL for internal-services.example.com from 300s → 30s as part of a planned migration - Also changed CNAME target: internal-services.example.com now points to a NEW NLB (was pointing to a legacy ALB) - Old ALB still up and answering, but its DNS records were not removed - New NLB was created in us-east-1 only — no us-west-2 record exists Logs: - nginx in us-west-2 trying to resolve internal-services.example.com → getting NXDOMAIN or stale IPs - ~30% of pods have cached the old ALB IP and are working; rest are failing - us-east-1 pods all resolve correctly to the new NLB On-call: - 22:08 — pager - 22:11 — DNS team confirms the planned change - 22:12 — debate: revert DNS or hot-patch us-west-2 to point at us-east-1 NLB
Summary
A planned DNS change at 21:40 UTC misconfigured the CNAME for internal-services.example.com to point to a new NLB that only exists in us-east-1, causing 35% of requests in us-west-2 to fail with 502 errors due to DNS resolution failures. The incident is ongoing with a partial workaround via cached DNS.
Severity reasoning: User-facing outage with 35% error rate > 1% for >5 minutes, revenue path broken (502 errors on API gateway), and regional unavailability in us-west-2. Meets SEV1 criteria.
deepseek-chat·prompt v2·output: en·11496ms·1784↑ / 1708↓ tok·$0.00236
Root cause hypotheses
- highDNS CNAME target points to a NLB that does not exist in us-west-2, causing NXDOMAIN for pods that resolve the new record.
Evidence: Logs show nginx in us-west-2 getting NXDOMAIN for internal-services.example.com; new NLB only created in us-east-1.
- mediumTTL reduction from 300s to 30s caused rapid propagation of the bad record, overwhelming pods with stale or missing DNS.
Evidence: DNS change at 21:40 UTC included TTL change; 35% failure rate correlates with DNS cache expiry.
- mediumOld ALB DNS records were not removed, causing split-brain DNS where some pods resolve to old ALB (working) and others to new NLB (failing).
Evidence: ~30% of pods working (cached old ALB IP); rest failing; old ALB still up but its DNS not removed.
- lowNetwork connectivity issue between us-west-2 and the new NLB (e.g., cross-region latency or security group).
Evidence: No evidence of connectivity issues; us-east-1 works fine; error is DNS resolution timeout, not connection timeout.
Investigation checklist
- Check DNS resolution for internal-services.example.com from a pod in us-west-2.
kubectl exec -n prod -l app=api-gateway -- nslookup internal-services.example.com 2>&1 | head -20Expected: Should show NXDOMAIN or an IP not matching the new NLB. If it shows the old ALB IP, the pod has cached DNS.
- Verify the new NLB exists in us-west-2.
aws elbv2 describe-load-balancers --region us-west-2 --names internal-services-nlb 2>&1Expected: Should return the NLB details. If error 'LoadBalancerNotFound', confirms missing NLB.
- Check DNS records for internal-services.example.com in Route53.
aws route53 list-resource-record-sets --hosted-zone-id ZONE_ID --query "ResourceRecordSets[?Name=='internal-services.example.com.']"Expected: Should show CNAME pointing to new NLB DNS name. Verify target exists in us-west-2.
- Check nginx error logs for upstream timeout details.
kubectl logs -n prod -l app=api-gateway --since=30m | grep 'upstream timed out' | head -10Expected: Should show 'Connection timed out while connecting to upstream' with IP addresses. Confirm those IPs are not from us-west-2 NLB.
- Check DNS cache TTL and resolution statistics on a failing pod.
kubectl exec -n prod -l app=api-gateway -- cat /etc/resolv.conf && kubectl exec -n prod -l app=api-gateway -- nslookup -debug internal-services.example.com 2>&1 | tail -20Expected: Should show TTL and resolution attempts. If NXDOMAIN, confirms DNS issue.
Mitigation plan
Revert DNS change: point internal-services.example.com CNAME back to the old ALB in us-west-2.
Risk: Low risk; old ALB is still operational. May cause brief DNS propagation delay.
Rollback: Re-apply the new CNAME target after fixing the NLB deployment.
Create a new NLB in us-west-2 and update DNS to point to it (if revert is not preferred).
Risk: Medium risk; requires creating NLB, configuring target groups, and updating DNS. May take 10-15 minutes.
Rollback: Revert DNS to old ALB if new NLB fails.
Hot-patch us-west-2 pods to use us-east-1 NLB endpoint via environment variable or configmap override.
Risk: Medium risk; cross-region latency may increase p99, and us-east-1 NLB may not be configured for us-west-2 traffic.
Rollback: Revert the configmap or environment variable change.
Customer impact
Approximately 35% of API requests to us-west-2 are failing with HTTP 502 errors. Users experience timeouts or errors when accessing services. us-east-1 users are unaffected. No ETA yet.
Postmortem draft
Summary
[FILL IN]
Timeline (UTC)
- 21:40: DNS team changes TTL and CNAME for internal-services.example.com
- 22:08: Pager triggered for 502 errors in us-west-2
- 22:10: Incident declared
- 22:11: DNS team confirms change
- 22:12: Mitigation options discussed [FILL IN rest]
Impact
35% error rate on api-gateway in us-west-2, p99 latency 4s, affecting all downstream services.
Root Cause
DNS CNAME pointed to a NLB that did not exist in us-west-2, causing DNS resolution failures for pods that did not have cached old records.
Detection
Monitoring alert for 502 error rate spike.
Response
[FILL IN]
What Went Well
- Quick identification of DNS change as root cause
- Partial mitigation via cached DNS
What Went Poorly
- DNS change not validated across regions
- No pre-deployment check for NLB existence
Action Items
- [FILL IN]
Follow-ups
- P0Implement pre-deployment DNS validation to ensure target endpoints exist in all regions.— platform team
- P1Add monitoring for DNS resolution failures per service per region.— observability team
- P1Review DNS change process to require multi-region validation.— DNS team
- P2Create runbook for DNS-related incidents with rollback steps.— on-call SRE
- P2Evaluate using service mesh or internal DNS with health checks to avoid single points of failure.— platform team
Similar past incidents
lexical match (pg_trgm)
- 83%
[Eval][v2][en] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 64%
[Eval][v1][en] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 46%
[Eval][v1][zh] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 44%
[Eval][v2][zh] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 26%
[Eval][v2][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts