[Eval][v1][en] Regional 5xx spike after DNS TTL change
service: api-gatewaycreated: 6/3/2026, 12:00:58 AM
Raw incident context
Time: 22:10 UTC. api-gateway in us-west-2 returning HTTP 502 for ~35% of requests. us-east-1 unaffected. Symptoms (us-west-2 only): - 502 rate: 35% (baseline 0.05%) - p99 latency: 4s (mostly DNS resolution timeouts in error trace) - nginx upstream errors: "upstream timed out (110: Connection timed out) while connecting to upstream" Recent changes: - 21:40 UTC — DNS team changed TTL for internal-services.example.com from 300s → 30s as part of a planned migration - Also changed CNAME target: internal-services.example.com now points to a NEW NLB (was pointing to a legacy ALB) - Old ALB still up and answering, but its DNS records were not removed - New NLB was created in us-east-1 only — no us-west-2 record exists Logs: - nginx in us-west-2 trying to resolve internal-services.example.com → getting NXDOMAIN or stale IPs - ~30% of pods have cached the old ALB IP and are working; rest are failing - us-east-1 pods all resolve correctly to the new NLB On-call: - 22:08 — pager - 22:11 — DNS team confirms the planned change - 22:12 — debate: revert DNS or hot-patch us-west-2 to point at us-east-1 NLB
Summary
A DNS change for internal-services.example.com at 21:40 UTC caused a 35% 502 error rate and p99 latency spike to 4s in us-west-2 api-gateway. The new CNAME target (NLB) only exists in us-east-1, leaving us-west-2 pods unable to resolve the service. Pods with cached old ALB IPs continue to work, while others fail with DNS resolution timeouts.
Severity reasoning: User-facing outage with 35% error rate and significant latency degradation in a major region (us-west-2). Revenue impact likely. Scope is regional but affects a large percentage of traffic. Reversibility is high (revert DNS or add us-west-2 record), but immediate action required.
deepseek-chat·prompt v1·output: en·10731ms·1403↑ / 1371↓ tok·$0.00189
Root cause hypotheses
- highDNS CNAME target points to NLB that does not exist in us-west-2
Evidence: New NLB was created only in us-east-1; us-west-2 pods get NXDOMAIN or stale IPs for internal-services.example.com
- mediumDNS TTL reduction from 300s to 30s caused rapid propagation of bad records
Evidence: TTL change was made at 21:40 UTC, incident started shortly after; 30% of pods still have cached old ALB IPs
- lowOld ALB DNS records not removed causing split-brain resolution
Evidence: Old ALB still up but its DNS records were not removed; some pods may resolve to old ALB (working) while others get new NLB (failing)
Investigation checklist
- Verify DNS resolution for internal-services.example.com from a us-west-2 pod
kubectl exec -n <namespace> <pod-name> -- nslookup internal-services.example.comExpected: Should return an IP in us-west-2; currently likely returns NXDOMAIN or an IP in us-east-1
- Check if the new NLB exists in us-west-2
aws elbv2 describe-load-balancers --region us-west-2 --names <new-nlb-name>Expected: Should return load balancer details; currently likely returns empty or error indicating no NLB in us-west-2
- Check nginx upstream error logs for DNS resolution failures
kubectl logs -n <namespace> -l app=api-gateway --tail=100 | grep 'upstream timed out'Expected: Lines showing 'upstream timed out (110: Connection timed out) while connecting to upstream'
- Confirm us-east-1 pods resolve correctly
kubectl exec -n <namespace> -l app=api-gateway --region us-east-1 -- nslookup internal-services.example.comExpected: Returns the new NLB IP in us-east-1
Mitigation plan
Revert DNS change: point internal-services.example.com back to the old ALB CNAME and restore TTL to 300s
Risk: Temporary increase in DNS traffic due to TTL change, but low risk; old ALB is still operational
Rollback: Re-apply the new DNS record (CNAME to new NLB) if needed
If revert is not possible, create a new NLB in us-west-2 and update DNS to point to that NLB
Risk: Time to create NLB and propagate DNS; may take 5-10 minutes; new NLB must be configured correctly
Rollback: Delete the new NLB and revert DNS to old ALB
Customer impact
Approximately 35% of requests to api-gateway in us-west-2 are failing with HTTP 502 errors. Affected users experience timeouts or errors. us-east-1 users are unaffected. ETA for full resolution is 15-30 minutes pending DNS propagation.
Postmortem draft
Postmortem: api-gateway 502 errors in us-west-2
Summary: [2-3 sentences]
Timeline:
- 21:40 UTC: DNS team changes TTL and CNAME for internal-services.example.com
- 22:08 UTC: Pager alert for 502 errors
- 22:10 UTC: Incident declared
- [Mitigation time]: DNS reverted / NLB created
Impact: 35% error rate, p99 latency 4s, us-west-2 only
Root Cause: New NLB only existed in us-east-1; us-west-2 pods could not resolve
What went well: Quick detection, clear logs, DNS team available
What went poorly: No pre-validation of DNS change across regions, no canary
Action Items:
- [ ] Add pre-change validation to check DNS targets exist in all regions
- [ ] Implement canary DNS rollout
- [ ] Add monitoring for DNS resolution failures per region
Follow-ups
- P0Add pre-change validation to ensure DNS targets exist in all required regions— platform team
- P1Implement canary DNS rollout with gradual TTL reduction— DNS team
- P1Add monitoring for DNS resolution failures per region in api-gateway— on-call SRE
- P2Review DNS change process to require cross-region validation— service owner
Similar past incidents
lexical match (pg_trgm)
- 75%
[Eval][v1][en] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 70%
[Eval][v2][en] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 45%
[Eval][v1][zh] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 43%
[Eval][v2][zh] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 30%
[Eval][v2][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts