[Eval][v1][zh] Regional 5xx spike after DNS TTL change
service: api-gatewaycreated: 6/3/2026, 12:01:48 AM
Raw incident context
Time: 22:10 UTC. api-gateway in us-west-2 returning HTTP 502 for ~35% of requests. us-east-1 unaffected. Symptoms (us-west-2 only): - 502 rate: 35% (baseline 0.05%) - p99 latency: 4s (mostly DNS resolution timeouts in error trace) - nginx upstream errors: "upstream timed out (110: Connection timed out) while connecting to upstream" Recent changes: - 21:40 UTC — DNS team changed TTL for internal-services.example.com from 300s → 30s as part of a planned migration - Also changed CNAME target: internal-services.example.com now points to a NEW NLB (was pointing to a legacy ALB) - Old ALB still up and answering, but its DNS records were not removed - New NLB was created in us-east-1 only — no us-west-2 record exists Logs: - nginx in us-west-2 trying to resolve internal-services.example.com → getting NXDOMAIN or stale IPs - ~30% of pods have cached the old ALB IP and are working; rest are failing - us-east-1 pods all resolve correctly to the new NLB On-call: - 22:08 — pager - 22:11 — DNS team confirms the planned change - 22:12 — debate: revert DNS or hot-patch us-west-2 to point at us-east-1 NLB
Summary
us-west-2区域api-gateway因DNS变更导致35%请求返回502错误,p99延迟升至4秒。DNS团队将internal-services.example.com的TTL从300s改为30s,并将CNAME指向新建的仅部署在us-east-1的NLB,导致us-west-2区域解析失败或使用过期IP。
Severity reasoning: 影响35%用户请求,但仅限us-west-2区域,非全局故障;无数据丢失或损坏;可通过DNS回滚或配置修复快速恢复。
deepseek-chat·prompt v1·output: zh·10399ms·1814↑ / 1312↓ tok·$0.00193
Root cause hypotheses
- highDNS CNAME指向了仅存在于us-east-1的NLB,us-west-2无对应记录,导致解析失败。
Evidence: us-west-2 nginx日志显示NXDOMAIN或超时;us-east-1正常。
- mediumTTL从300s改为30s导致DNS缓存快速过期,但新记录未正确传播到us-west-2的DNS解析器。
Evidence: 变更发生在21:40,22:10仍存在35%故障,部分pod使用旧ALB IP正常工作。
- low新NLB的健康检查配置错误,导致us-west-2流量被路由到不健康的端点。
Evidence: 无直接证据,但502错误与连接超时相关,可能目标组未正确配置。
Investigation checklist
- 确认us-west-2 DNS解析结果
kubectl exec -n ingress -it <nginx-pod> -- nslookup internal-services.example.com 2>/dev/null || dig internal-services.example.com @<dns-server>Expected: 应返回us-west-2 NLB IP;若返回NXDOMAIN或旧ALB IP,则确认DNS问题。
- 检查新NLB的us-west-2目标组配置
aws elbv2 describe-target-groups --region us-west-2 --names <new-nlb-tg-name>Expected: 目标组应存在且包含健康实例;若不存在,则NLB未部署到us-west-2。
- 验证DNS变更记录
aws route53 list-resource-record-sets --hosted-zone-id <zone-id> --query "ResourceRecordSets[?Name=='internal-services.example.com.']"Expected: CNAME记录应指向us-west-2 NLB的DNS名称;若指向us-east-1,则确认问题。
- 检查nginx upstream配置
kubectl exec -n ingress -it <nginx-pod> -- cat /etc/nginx/conf.d/upstream.conf | grep internal-servicesExpected: upstream应使用域名而非硬编码IP;若使用IP,则需更新。
Mitigation plan
立即回滚DNS变更:将internal-services.example.com的CNAME指向旧ALB,TTL恢复为300s。
Risk: 回滚期间DNS传播延迟,可能短暂加剧故障;但旧ALB仍运行,风险低。
Rollback: 重新应用新NLB的CNAME记录。
若回滚不可行,在us-west-2创建NLB并更新DNS记录指向该NLB。
Risk: 创建NLB需要时间,且可能引入新配置错误;需确保安全组和健康检查正确。
Rollback: 删除新NLB并回滚DNS到旧ALB。
Customer impact
us-west-2区域约35%用户请求失败,返回502错误,受影响用户无法正常使用服务。其他区域正常。预计在DNS回滚生效后5-10分钟内恢复。
Postmortem draft
事后复盘
摘要
us-west-2区域api-gateway因DNS配置错误导致35%请求失败。
时间线
- 21:40 DNS团队变更TTL和CNAME
- 22:08 告警触发
- 22:10 确认故障
- 22:XX 回滚DNS
- 22:XX 恢复
影响
us-west-2区域35%用户,持续约XX分钟。
根因
DNS CNAME指向仅存在于us-east-1的NLB,us-west-2无对应记录。
做得好的
- 快速告警和响应
- DNS团队及时沟通
做得不好的
- 变更未在us-west-2验证
- 缺少预发布检查
行动项
- [ ] 增加DNS变更的跨区域验证
- [ ] 添加DNS解析监控
- [ ] 改进变更流程
Follow-ups
- P1增加DNS变更的跨区域验证步骤到变更流程— 平台团队
- P1添加DNS解析失败监控和告警— on-call SRE
- P2审查DNS变更流程,增加预发布检查清单— 服务负责人
- P2考虑使用服务网格或本地DNS缓存减少DNS依赖— 平台团队
Similar past incidents
lexical match (pg_trgm)
- 73%
[Eval][v1][zh] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 60%
[Eval][v2][zh] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 43%
[Eval][v2][en] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 43%
[Eval][v1][en] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 18%
[Eval][v1][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts