[Eval][v1][en] Regional 5xx spike after DNS TTL change
service: api-gatewaycreated: 6/3/2026, 12:00:11 AM
Raw incident context
Time: 22:10 UTC. api-gateway in us-west-2 returning HTTP 502 for ~35% of requests. us-east-1 unaffected. Symptoms (us-west-2 only): - 502 rate: 35% (baseline 0.05%) - p99 latency: 4s (mostly DNS resolution timeouts in error trace) - nginx upstream errors: "upstream timed out (110: Connection timed out) while connecting to upstream" Recent changes: - 21:40 UTC — DNS team changed TTL for internal-services.example.com from 300s → 30s as part of a planned migration - Also changed CNAME target: internal-services.example.com now points to a NEW NLB (was pointing to a legacy ALB) - Old ALB still up and answering, but its DNS records were not removed - New NLB was created in us-east-1 only — no us-west-2 record exists Logs: - nginx in us-west-2 trying to resolve internal-services.example.com → getting NXDOMAIN or stale IPs - ~30% of pods have cached the old ALB IP and are working; rest are failing - us-east-1 pods all resolve correctly to the new NLB On-call: - 22:08 — pager - 22:11 — DNS team confirms the planned change - 22:12 — debate: revert DNS or hot-patch us-west-2 to point at us-east-1 NLB
Summary
A DNS change for internal-services.example.com at 21:40 UTC caused a partial outage in us-west-2: the new CNAME target (NLB) only exists in us-east-1, so ~65% of api-gateway pods in us-west-2 fail to resolve the new target and return 502 errors. The remaining 35% still have cached old ALB IPs and work. p99 latency spikes to 4s due to DNS timeouts.
Severity reasoning: User-facing outage: 35% of requests to api-gateway in us-west-2 return HTTP 502, impacting a significant subset of users. Revenue impact likely. Full outage of a critical system component (DNS resolution for internal service).
deepseek-chat·prompt v1·output: en·9893ms·1403↑ / 1433↓ tok·$0.00196
Root cause hypotheses
- highDNS CNAME target points to an NLB that does not exist in us-west-2, causing NXDOMAIN for pods that flush their cache.
Evidence: DNS team changed CNAME to new NLB at 21:40; new NLB only exists in us-east-1; nginx logs show NXDOMAIN or stale IPs for internal-services.example.com in us-west-2.
- highTTL reduction from 300s to 30s caused rapid cache expiration, exposing the missing record in us-west-2.
Evidence: TTL changed at same time as CNAME; 35% of pods still have old cached IP (working), rest fail; consistent with TTL-driven cache expiry.
- lowDNS propagation delay: the new NLB record hasn't propagated to us-west-2 DNS resolvers yet.
Evidence: us-west-2 pods get NXDOMAIN; us-east-1 works; but propagation should be fast for a CNAME change; less likely given 30 min elapsed.
Investigation checklist
- Verify DNS resolution from a failing pod in us-west-2.
kubectl exec -n api-gateway <pod-name> -- nslookup internal-services.example.comExpected: NXDOMAIN or IP of new NLB (which is in us-east-1, not us-west-2).
- Check if the new NLB exists in us-west-2.
aws elbv2 describe-load-balancers --region us-west-2 --names <new-nlb-name>Expected: Empty result or error indicating NLB not found in us-west-2.
- Check DNS records for internal-services.example.com.
dig internal-services.example.com @<us-west-2-dns-resolver>Expected: CNAME pointing to new NLB DNS name (which resolves to us-east-1 IPs).
- Confirm old ALB still exists and is healthy.
aws elbv2 describe-load-balancers --region us-west-2 --names <old-alb-name>Expected: ALB exists and is active.
Mitigation plan
Revert the DNS change: point internal-services.example.com back to the old ALB in us-west-2.
Risk: Low risk; reverts to known working state. May cause brief propagation delay.
Rollback: Re-apply the new CNAME if needed after fixing the NLB deployment.
Alternatively, create a new NLB in us-west-2 and update the CNAME to point to it (if NLB is required).
Risk: Medium risk; requires creating NLB and updating DNS; may take longer.
Rollback: Revert DNS to old ALB.
Customer impact
Approximately 35% of requests to api-gateway in us-west-2 are failing with HTTP 502 errors. Users may experience timeouts or errors when accessing services that route through api-gateway. us-east-1 users are unaffected. ETA: immediate mitigation via DNS revert expected within 5 minutes.
Postmortem draft
Postmortem: api-gateway 502 errors in us-west-2
Summary
Brief description of incident.
Timeline
- 21:40 UTC: DNS team changes TTL and CNAME for internal-services.example.com
- 22:08 UTC: Pager alert for 502 errors
- 22:10 UTC: Incident declared
- 22:12 UTC: Root cause identified
- [Mitigation time]: DNS reverted
Impact
- 35% of requests to api-gateway in us-west-2 returned 502
- p99 latency 4s
- Duration: [X minutes]
Root Cause
DNS CNAME pointed to an NLB that did not exist in us-west-2.
What Went Well
- Monitoring detected the issue quickly
- DNS team available to confirm change
What Went Poorly
- No pre-deployment validation of DNS target existence in all regions
- TTL reduction exacerbated impact
Action Items
- [ ] Add pre-change validation for DNS targets (P0)
- [ ] Implement canary DNS changes (P1)
- [ ] Review TTL change process (P2)
Follow-ups
- P0Add pre-change validation script to check DNS target existence in all regions.— platform team
- P1Implement canary DNS changes with gradual rollout.— platform team
- P1Review TTL change process and add approval for reductions below 60s.— DNS team
- P2Update runbook for DNS-related incidents with specific rollback steps.— on-call SRE
- P2Add monitoring for DNS resolution failures per region.— platform team
Similar past incidents
lexical match (pg_trgm)
- 91%
[Eval][v1][en] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 66%
[Eval][v2][en] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 44%
[Eval][v1][zh] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 43%
[Eval][v2][zh] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 29%
[Eval][v2][en] Third-party payment gateway timeouts cascade into checkout outage
Checkout success rate dropped from 99.5% to 22%, p99 latency 28s (hitting 30s timeout), thread pool exhausted