[Eval][v3][en] Regional 5xx spike after DNS TTL change
service: api-gatewaycreated: 6/2/2026, 11:40:28 PM
Raw incident context
Time: 22:10 UTC. api-gateway in us-west-2 returning HTTP 502 for ~35% of requests. us-east-1 unaffected. Symptoms (us-west-2 only): - 502 rate: 35% (baseline 0.05%) - p99 latency: 4s (mostly DNS resolution timeouts in error trace) - nginx upstream errors: "upstream timed out (110: Connection timed out) while connecting to upstream" Recent changes: - 21:40 UTC — DNS team changed TTL for internal-services.example.com from 300s → 30s as part of a planned migration - Also changed CNAME target: internal-services.example.com now points to a NEW NLB (was pointing to a legacy ALB) - Old ALB still up and answering, but its DNS records were not removed - New NLB was created in us-east-1 only — no us-west-2 record exists Logs: - nginx in us-west-2 trying to resolve internal-services.example.com → getting NXDOMAIN or stale IPs - ~30% of pods have cached the old ALB IP and are working; rest are failing - us-east-1 pods all resolve correctly to the new NLB On-call: - 22:08 — pager - 22:11 — DNS team confirms the planned change - 22:12 — debate: revert DNS or hot-patch us-west-2 to point at us-east-1 NLB
Summary
A DNS change at 21:40 UTC for internal-services.example.com (TTL reduced from 300s to 30s, CNAME target changed to a new NLB) caused a regional outage in us-west-2. The new NLB was only created in us-east-1, so us-west-2 pods fail to resolve the new target, resulting in 35% 502 errors and elevated latency. The old ALB still exists but its DNS records were removed, causing NXDOMAIN for pods that don't have cached old IPs.
Severity reasoning: User-facing outage: 35% error rate (exceeds 1% threshold) for >5 minutes, affecting a major region (us-west-2). Revenue path likely broken due to high error rate. This meets SEV1 criteria.
deepseek-chat·prompt v3·output: en·10540ms·2035↑ / 1585↓ tok·$0.00229
Root cause hypotheses
- highNew NLB not deployed in us-west-2, causing DNS resolution failures for pods that flush cache
Evidence: Logs show nginx in us-west-2 getting NXDOMAIN or stale IPs for internal-services.example.com. us-east-1 works because NLB exists there.
- highDNS TTL reduction from 300s to 30s caused rapid cache expiration, exposing the missing NLB in us-west-2
Evidence: Change occurred at 21:40 UTC; incident started around 22:10 UTC, consistent with TTL expiration window. 30% of pods still working (cached old ALB IP).
- mediumOld ALB DNS records were removed, preventing fallback resolution
Evidence: Context states 'its DNS records were not removed' but later says 'NXDOMAIN' which implies removal. Contradiction; likely records were removed or not published.
- lowDNS propagation delay causing inconsistent resolution across pods
Evidence: Some pods have cached old IPs and work; others fail. This is consistent with gradual cache expiration, not propagation delay.
Investigation checklist
- Verify DNS resolution for internal-services.example.com from us-west-2 pods
kubectl exec -n prod -l app=api-gateway -- nslookup internal-services.example.com 2>&1 | head -20Expected: Should return NXDOMAIN or an IP not in us-west-2 (e.g., us-east-1 NLB IP). If old ALB IP appears, cache is stale.
- Check if new NLB exists in us-west-2
aws elbv2 describe-load-balancers --region us-west-2 --names new-nlb-name 2>&1Expected: Should return 'LoadBalancerNotFound' if not deployed.
- Check nginx error logs for upstream connection failures
kubectl logs -n prod -l app=api-gateway --since=30m | grep -i 'upstream timed out' | tail -20Expected: Multiple entries showing connection timeouts to upstream.
- Check DNS TTL configuration for internal-services.example.com
dig internal-services.example.com +short +ttlid 2>&1Expected: TTL should be 30s (new value) or 300s (old).
- Verify old ALB is still operational
aws elbv2 describe-target-health --region us-west-2 --target-group-arn <old-alb-tg-arn> 2>&1Expected: Targets should be healthy if ALB is still up.
Mitigation plan
Revert DNS change: point internal-services.example.com back to old ALB CNAME with original TTL (300s)
Risk: Low risk; old ALB is still up. May cause brief propagation delay.
Rollback: Re-apply the new DNS change after fixing NLB deployment.
Deploy new NLB in us-west-2 and update DNS to point to it
Risk: Medium risk; takes time to create NLB and update DNS. During creation, outage continues.
Rollback: Revert DNS to old ALB if new NLB fails.
Hot-patch us-west-2 pods to use us-east-1 NLB IP directly via /etc/hosts or environment variable
Risk: Medium risk; cross-region latency increase, potential for asymmetric routing. Requires pod restart.
Rollback: Remove the patch and revert to DNS resolution.
Customer impact
Approximately 35% of requests to us-west-2 are failing with HTTP 502 errors. Users experience timeouts or errors when accessing services. us-east-1 users are unaffected. No ETA yet.
Postmortem draft
Summary
A planned DNS migration caused a regional outage in us-west-2 due to missing NLB deployment in that region.
Timeline (UTC)
- 21:40 - DNS team changes TTL and CNAME for internal-services.example.com
- 22:08 - Pager alerts for elevated 502 errors
- 22:10 - Incident declared
- [FILL IN] - Mitigation applied
Impact
- 35% error rate in us-west-2 for ~[FILL IN] minutes
- p99 latency 4s
- [FILL IN] users affected
Root Cause
New NLB was only created in us-east-1. DNS change caused us-west-2 pods to fail resolution after cache expired.
Detection
Monitoring alerted on 502 error rate spike.
Response
- [FILL IN] actions taken
What Went Well
- Monitoring detected quickly
- DNS team available for context
What Went Poorly
- No pre-deployment verification of NLB in all regions
- DNS change not coordinated with infrastructure team
Action Items
- [FILL IN]
Follow-ups
- P0Ensure new NLB is deployed in all regions before DNS cutover— infrastructure team
- P1Add pre-change validation checklist for DNS migrations— platform team
- P1Improve monitoring to detect regional DNS resolution failures— observability team
- P2Review DNS change process to require multi-region verification— SRE team
Similar past incidents
lexical match (pg_trgm)
- 80%
[Eval][v1][en] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 61%
[Eval][v2][en] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 42%
[Eval][v2][zh] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 42%
[Eval][v1][zh] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 27%
[Eval][v2][en] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted