[Eval][v2][en] Regional 5xx spike after DNS TTL change
service: api-gatewaycreated: 6/3/2026, 12:02:05 AM
Raw incident context
Time: 22:10 UTC. api-gateway in us-west-2 returning HTTP 502 for ~35% of requests. us-east-1 unaffected. Symptoms (us-west-2 only): - 502 rate: 35% (baseline 0.05%) - p99 latency: 4s (mostly DNS resolution timeouts in error trace) - nginx upstream errors: "upstream timed out (110: Connection timed out) while connecting to upstream" Recent changes: - 21:40 UTC — DNS team changed TTL for internal-services.example.com from 300s → 30s as part of a planned migration - Also changed CNAME target: internal-services.example.com now points to a NEW NLB (was pointing to a legacy ALB) - Old ALB still up and answering, but its DNS records were not removed - New NLB was created in us-east-1 only — no us-west-2 record exists Logs: - nginx in us-west-2 trying to resolve internal-services.example.com → getting NXDOMAIN or stale IPs - ~30% of pods have cached the old ALB IP and are working; rest are failing - us-east-1 pods all resolve correctly to the new NLB On-call: - 22:08 — pager - 22:11 — DNS team confirms the planned change - 22:12 — debate: revert DNS or hot-patch us-west-2 to point at us-east-1 NLB
Summary
A planned DNS change for internal-services.example.com (TTL reduction and CNAME target switch) caused a partial outage in us-west-2, where the new NLB target does not exist. 35% of requests to api-gateway in us-west-2 return 502 errors due to DNS resolution failures, while us-east-1 remains unaffected.
Severity reasoning: User-facing outage: error rate >1% (35%) for >5 min, revenue path broken (api-gateway returns 502), regional unavailability in us-west-2. Fits SEV1 rubric.
deepseek-chat·prompt v2·output: en·12932ms·1784↑ / 1683↓ tok·$0.00233
Root cause hypotheses
- highDNS CNAME target changed to a new NLB that does not exist in us-west-2, causing NXDOMAIN for pods that resolve after the TTL expires.
Evidence: Logs show nginx in us-west-2 getting NXDOMAIN for internal-services.example.com; new NLB was created only in us-east-1.
- highStale DNS cache in some pods (old ALB IP) allows ~65% of requests to succeed, while pods that re-resolve after TTL expiry fail.
Evidence: ~30% of pods have cached old ALB IP and work; rest fail. TTL was reduced from 300s to 30s, accelerating cache expiry.
- mediumDNS propagation delay or misconfiguration in us-west-2 DNS servers causes intermittent resolution failures.
Evidence: NXDOMAIN observed; but us-east-1 resolves correctly, suggesting a regional DNS issue rather than global misconfiguration.
- lowThe new NLB in us-east-1 is overloaded or misconfigured, causing timeouts when us-west-2 pods attempt to connect cross-region.
Evidence: Error traces show 'Connection timed out' during upstream connection, but us-east-1 pods work fine, making cross-region latency unlikely as primary cause.
Investigation checklist
- Check DNS resolution of internal-services.example.com from a pod in us-west-2.
kubectl exec -n prod -l app=api-gateway -- nslookup internal-services.example.com 2>&1 | head -20Expected: If NXDOMAIN or wrong IP, confirms DNS misconfiguration.
- Verify the new NLB exists in us-west-2.
aws elbv2 describe-load-balancers --region us-west-2 --names internal-services-nlb 2>&1Expected: If 'LoadBalancerNotFound', confirms NLB missing in us-west-2.
- Check nginx upstream configuration and error logs for upstream IPs.
kubectl logs -n prod -l app=api-gateway --since=30m | grep -E 'upstream timed out|connect() failed' | head -20Expected: Should show connection attempts to IPs that are either old ALB or NXDOMAIN.
- Compare DNS resolution from us-east-1 pod.
kubectl exec -n prod -l app=api-gateway --context=us-east-1 -- nslookup internal-services.example.com 2>&1 | head -20Expected: Should return the new NLB IP (us-east-1).
- Check if old ALB still has healthy targets in us-west-2.
aws elbv2 describe-target-health --region us-west-2 --load-balancer-arn <old-alb-arn> 2>&1Expected: If targets are healthy, old ALB can serve traffic.
Mitigation plan
Revert DNS change: point internal-services.example.com back to the old ALB CNAME in us-west-2 (or globally).
Risk: Low risk; old ALB is still operational. May cause brief propagation delay.
Rollback: Re-apply the new DNS record after verifying NLB exists in us-west-2.
Create a new NLB in us-west-2 and update DNS to point to it (if revert is not preferred).
Risk: Medium risk; requires creating NLB, configuring target groups, and updating DNS. Takes time.
Rollback: Revert DNS to old ALB.
Hot-patch us-west-2 pods to use us-east-1 NLB IP directly via /etc/hosts or environment variable.
Risk: Medium risk; cross-region latency may increase, and manual changes need rollback after fix.
Rollback: Remove the host override and revert to DNS.
Customer impact
Approximately 35% of requests to api-gateway in us-west-2 are failing with HTTP 502 errors. Users in that region may experience service unavailability or timeouts. us-east-1 users are unaffected. No data loss expected.
Postmortem draft
Summary
[FILL IN]
Timeline (UTC)
- 21:40 — DNS team changes TTL and CNAME for internal-services.example.com
- 22:08 — Pager alert for 502 errors in us-west-2
- 22:10 — Incident declared
- 22:11 — DNS team confirms change
- 22:12 — Mitigation discussion
- [FILL IN] — Mitigation applied
Impact
- 35% error rate on api-gateway in us-west-2 for ~[FILL IN] minutes
- p99 latency 4s
- Estimated [FILL IN] users affected
Root Cause
DNS CNAME target changed to a new NLB that only exists in us-east-1, causing NXDOMAIN in us-west-2. Reduced TTL accelerated cache expiry, increasing failure rate.
Detection
Pager alert from monitoring (error rate threshold exceeded).
Response
[FILL IN]
What Went Well
- Quick identification of recent DNS change
- us-east-1 unaffected
What Went Poorly
- DNS change not validated across regions
- No pre-deployment check for NLB existence in all regions
Action Items
- [FILL IN]
Follow-ups
- P0Add pre-deployment validation for DNS changes to ensure target resources exist in all regions.— platform team
- P1Implement canary DNS rollout with gradual TTL reduction.— DNS team
- P1Add monitoring for DNS resolution failures per region.— observability team
- P2Review incident response time and improve runbook for DNS-related issues.— on-call SRE
- P1Create a cross-region NLB deployment checklist.— infrastructure team
Similar past incidents
lexical match (pg_trgm)
- 64%
[Eval][v2][en] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 63%
[Eval][v1][en] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 46%
[Eval][v1][zh] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 43%
[Eval][v2][zh] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 26%
[Eval][v2][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts