[Eval][v2][en] Regional 5xx spike after DNS TTL change
service: api-gatewaycreated: 5/25/2026, 10:23:03 PM
Raw incident context
Time: 22:10 UTC. api-gateway in us-west-2 returning HTTP 502 for ~35% of requests. us-east-1 unaffected. Symptoms (us-west-2 only): - 502 rate: 35% (baseline 0.05%) - p99 latency: 4s (mostly DNS resolution timeouts in error trace) - nginx upstream errors: "upstream timed out (110: Connection timed out) while connecting to upstream" Recent changes: - 21:40 UTC — DNS team changed TTL for internal-services.example.com from 300s → 30s as part of a planned migration - Also changed CNAME target: internal-services.example.com now points to a NEW NLB (was pointing to a legacy ALB) - Old ALB still up and answering, but its DNS records were not removed - New NLB was created in us-east-1 only — no us-west-2 record exists Logs: - nginx in us-west-2 trying to resolve internal-services.example.com → getting NXDOMAIN or stale IPs - ~30% of pods have cached the old ALB IP and are working; rest are failing - us-east-1 pods all resolve correctly to the new NLB On-call: - 22:08 — pager - 22:11 — DNS team confirms the planned change - 22:12 — debate: revert DNS or hot-patch us-west-2 to point at us-east-1 NLB
Summary
A planned DNS change at 21:40 UTC misconfigured the CNAME for internal-services.example.com to point to a new NLB that exists only in us-east-1, causing 35% of requests in us-west-2 to fail with HTTP 502 due to DNS resolution timeouts. The incident is ongoing with a 35% error rate and elevated p99 latency of 4s.
Severity reasoning: User-facing outage with error rate >1% (35%) for >5 minutes, meeting SEV1 criteria. Revenue path is likely broken for affected users.
deepseek-chat·prompt v2·output: en·10807ms
Root cause hypotheses
- highDNS CNAME misconfiguration: new NLB only exists in us-east-1, causing NXDOMAIN for us-west-2 queries
Evidence: Logs show nginx in us-west-2 getting NXDOMAIN or stale IPs for internal-services.example.com; DNS team confirmed CNAME changed to new NLB which is us-east-1 only
- mediumStale DNS cache in some pods: pods with cached old ALB IP work, others fail
Evidence: ~30% of pods working (cached old IP), rest failing; TTL was reduced from 300s to 30s but some pods may not have refreshed
- lowNew NLB health check failure causing DNS to return no records
Evidence: No direct evidence; NLB is new and may not have passed health checks, but us-east-1 works fine
Investigation checklist
- Check DNS resolution for internal-services.example.com from us-west-2 pods
kubectl exec -n prod -l app=api-gateway -- dig +short internal-services.example.com @<cluster-dns-ip>Expected: Should return an IP; if NXDOMAIN or empty, DNS misconfiguration confirmed
- Verify new NLB exists in us-west-2
aws elbv2 describe-load-balancers --region us-west-2 --names new-nlb-nameExpected: If command returns empty or error, NLB does not exist in us-west-2
- Check nginx upstream error logs for DNS resolution failures
kubectl logs -n prod -l app=api-gateway --since=30m | grep -i 'upstream timed out' | head -20Expected: Should show 'Connection timed out' errors with upstream hostname
- Compare DNS records between regions
dig +short internal-services.example.com @<us-west-2-resolver> && dig +short internal-services.example.com @<us-east-1-resolver>Expected: us-west-2 should return IP of old ALB or new NLB; if different, misconfiguration
- Check if old ALB is still healthy and serving
aws elbv2 describe-target-health --region us-west-2 --target-group-arn <old-alb-tg-arn>Expected: Targets should be healthy; if not, old ALB may be degraded
Mitigation plan
Revert DNS CNAME for internal-services.example.com to point back to the old ALB in us-west-2
Risk: Low risk; old ALB is still healthy. May cause brief propagation delay.
Rollback: Re-apply the new CNAME if needed after fixing the NLB deployment
If revert not possible, create a new NLB in us-west-2 and update DNS to point to it
Risk: Medium risk; provisioning NLB takes time, and DNS propagation may cause continued errors. Ensure health checks pass before switching.
Rollback: Revert DNS to old ALB if new NLB fails
Customer impact
Approximately 35% of users in the us-west-2 region are experiencing HTTP 502 errors and high latency (p99 4s) when accessing the API gateway. Users in us-east-1 are unaffected. Estimated affected users: 35% of traffic in us-west-2.
Postmortem draft
Summary
[FILL IN]
Timeline (UTC)
- 21:40: DNS team changed TTL and CNAME for internal-services.example.com
- 22:08: Pager alerted for 35% 502 errors in us-west-2
- 22:10: Incident declared
- 22:11: DNS team confirmed change
- 22:12: Debate on mitigation
- [FILL IN] Mitigation applied
- [FILL IN] Incident resolved
Impact
- 35% error rate on api-gateway in us-west-2 for [duration] minutes
- p99 latency 4s
- Estimated [X] users affected
Root Cause
DNS CNAME for internal-services.example.com was changed to point to a new NLB that only existed in us-east-1, causing DNS resolution failures in us-west-2.
Detection
Alert on 502 error rate triggered pager at 22:08.
Response
- [FILL IN] Reverted DNS to old ALB
- [FILL IN] Verified recovery
What Went Well
- Quick detection via monitoring
- DNS team available to confirm change
What Went Poorly
- DNS change not validated across regions before deployment
- No pre-change review for cross-region impact
Action Items
- [FILL IN] Add pre-deployment validation for DNS changes across regions
- [FILL IN] Implement canary DNS rollout
- [FILL IN] Review change management process for DNS
Follow-ups
- P0Add pre-deployment validation to check that DNS targets exist in all required regions— platform team
- P1Implement canary DNS rollout with gradual traffic shift— DNS team
- P1Review change management process to require cross-region impact assessment for DNS changes— SRE team
- P2Add monitoring for DNS resolution failures per region— observability team
- P2Document incident and update runbook for DNS-related outages— on-call SRE
Similar past incidents
lexical match (pg_trgm)
- 64%
[Eval][v1][en] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 45%
[Eval][v1][zh] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 43%
[Eval][v2][zh] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 31%
[Eval][v1][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 30%
[Eval][v2][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts