[Eval][v2][en] Regional 5xx spike after DNS TTL change
service: api-gatewaycreated: 6/3/2026, 12:02:47 AM
Raw incident context
Time: 22:10 UTC. api-gateway in us-west-2 returning HTTP 502 for ~35% of requests. us-east-1 unaffected. Symptoms (us-west-2 only): - 502 rate: 35% (baseline 0.05%) - p99 latency: 4s (mostly DNS resolution timeouts in error trace) - nginx upstream errors: "upstream timed out (110: Connection timed out) while connecting to upstream" Recent changes: - 21:40 UTC — DNS team changed TTL for internal-services.example.com from 300s → 30s as part of a planned migration - Also changed CNAME target: internal-services.example.com now points to a NEW NLB (was pointing to a legacy ALB) - Old ALB still up and answering, but its DNS records were not removed - New NLB was created in us-east-1 only — no us-west-2 record exists Logs: - nginx in us-west-2 trying to resolve internal-services.example.com → getting NXDOMAIN or stale IPs - ~30% of pods have cached the old ALB IP and are working; rest are failing - us-east-1 pods all resolve correctly to the new NLB On-call: - 22:08 — pager - 22:11 — DNS team confirms the planned change - 22:12 — debate: revert DNS or hot-patch us-west-2 to point at us-east-1 NLB
Summary
A planned DNS change for internal-services.example.com (TTL reduction and CNAME target switch) caused a partial outage in us-west-2, where the new NLB target does not exist. This results in 35% 502 errors and elevated latency due to DNS resolution failures for pods that cannot resolve the new target.
Severity reasoning: User-facing outage with error rate >1% (35%) for >5 minutes, affecting a significant portion of traffic in us-west-2. Revenue path may be broken if api-gateway is critical. This meets SEV1 criteria.
deepseek-chat·prompt v2·output: en·15362ms·1784↑ / 2017↓ tok·$0.00270
Root cause hypotheses
- highDNS CNAME target changed to a NLB that does not exist in us-west-2, causing resolution failures for pods that do not have cached old IPs.
Evidence: Logs show NXDOMAIN or stale IPs for internal-services.example.com; new NLB only exists in us-east-1. 30% of pods with cached old ALB IPs still work.
- highTTL reduction from 300s to 30s caused rapid propagation of the new (broken) DNS record, increasing the failure rate as caches expired.
Evidence: DNS change occurred at 21:40 UTC; incident started shortly after. TTL reduction accelerates adoption of the bad record.
- mediumDNS team did not validate that the new NLB existed in all regions before changing the CNAME.
Evidence: New NLB created only in us-east-1; no us-west-2 record. This is a procedural gap.
- lownginx resolver configuration in us-west-2 is not using a valid DNS server or has caching issues.
Evidence: Some pods work (cached old IPs), others fail. This suggests resolver behavior is normal but the record is missing.
- lowThe old ALB in us-west-2 was inadvertently taken out of service or its DNS records removed.
Evidence: Old ALB still up and answering, but its DNS records were not removed. This is not the cause.
Investigation checklist
- Verify that the new NLB exists in us-west-2.
aws elbv2 describe-load-balancers --region us-west-2 --names new-nlb-name 2>&1Expected: If NLB not found, command returns error 'LoadBalancerNotFound'. This confirms the root cause.
- Check DNS resolution for internal-services.example.com from a failing pod in us-west-2.
kubectl exec -n prod -l app=api-gateway -- nslookup internal-services.example.com 2>&1 | head -20Expected: NXDOMAIN or IP of new NLB (which is in us-east-1, not us-west-2).
- Check DNS resolution from a working pod (likely has cached old ALB IP).
kubectl exec -n prod -l app=api-gateway -- nslookup internal-services.example.com 2>&1 | head -20Expected: Old ALB IP (e.g., 10.x.x.x) still cached.
- Check nginx error logs for upstream connection timeouts.
kubectl logs -n prod -l app=api-gateway --since=30m | grep 'upstream timed out' | head -20Expected: Lines showing 'upstream timed out (110: Connection timed out) while connecting to upstream'.
- Verify the old ALB is still healthy and serving in us-west-2.
aws elbv2 describe-target-health --region us-west-2 --load-balancer-arn <old-alb-arn> 2>&1Expected: All targets healthy. This confirms old ALB is available.
- Check if any pods have the new NLB IP cached (should not exist in us-west-2).
kubectl exec -n prod -l app=api-gateway -- nslookup internal-services.example.com 2>&1 | grep -E '^Address' | head -5Expected: If new NLB IP appears, it's from us-east-1 and will not work.
Mitigation plan
Revert DNS change: point internal-services.example.com back to the old ALB CNAME and restore TTL to 300s.
Risk: Low risk; old ALB is healthy. May cause brief propagation delay.
Rollback: Re-apply the new DNS change if needed after fixing the NLB deployment.
Alternatively, create a new NLB in us-west-2 and update the DNS CNAME to point to it (if faster than revert).
Risk: Medium risk: creating NLB takes time, and DNS propagation may cause continued errors. Requires correct configuration.
Rollback: Revert DNS to old ALB if new NLB fails.
If revert is chosen, flush DNS caches on api-gateway pods to speed recovery.
Risk: Brief connection drain during restart; pods will re-resolve DNS.
Rollback: Monitor; if restart causes issues, wait for natural cache expiry.
Customer impact
Approximately 35% of requests to the API gateway in us-west-2 are failing with HTTP 502 errors. Users in that region may experience service unavailability or timeouts. us-east-1 users are unaffected. Estimated affected users: 35% of us-west-2 traffic.
Postmortem draft
Summary
A planned DNS change (TTL reduction and CNAME target switch) caused a partial outage in us-west-2 because the new NLB target did not exist in that region. 35% of requests failed with 502 errors.
Timeline (UTC)
- 21:40 - DNS team changes TTL from 300s to 30s and CNAME target to new NLB (us-east-1 only).
- [FILL IN] - First 502 errors observed.
- 22:08 - Pager triggered.
- 22:10 - Incident declared.
- 22:11 - DNS team confirms change.
- [FILL IN] - Mitigation applied (revert DNS or create NLB).
- [FILL IN] - Error rate returns to baseline.
Impact
- 35% error rate on api-gateway in us-west-2 for [FILL IN] minutes.
- p99 latency 4s.
- No data loss.
Root Cause
DNS CNAME target changed to a NLB that did not exist in us-west-2. TTL reduction accelerated propagation of the broken record.
Detection
Pager alert based on elevated 502 error rate.
Response
- [FILL IN] - DNS reverted or NLB created.
- [FILL IN] - Pods restarted to clear cache.
What Went Well
- DNS team quickly identified the change.
- Old ALL still healthy, enabling quick revert.
What Went Poorly
- No pre-change validation that new NLB existed in all regions.
- No canary or staged rollout of DNS change.
Action Items
- [FILL IN] - Add pre-change validation for multi-region DNS targets.
- [FILL IN] - Implement staged DNS rollouts with monitoring.
- [FILL IN] - Review DNS change process to require region coverage check.
Follow-ups
- P0Add pre-change validation script to verify DNS target exists in all required regions.— DNS team
- P1Implement staged DNS rollout with automatic rollback on error rate increase.— Platform team
- P1Update incident response runbook to include DNS change verification steps.— SRE team
- P2Add monitoring for DNS resolution failures per region.— Observability team
- P2Review TTL reduction best practices and add guardrails.— DNS team
Similar past incidents
lexical match (pg_trgm)
- 69%
[Eval][v1][en] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 68%
[Eval][v2][en] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 46%
[Eval][v1][zh] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 42%
[Eval][v2][zh] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 28%
[Eval][v2][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts