[Eval][v1][zh] Regional 5xx spike after DNS TTL change
service: api-gatewaycreated: 6/3/2026, 12:01:16 AM
Raw incident context
Time: 22:10 UTC. api-gateway in us-west-2 returning HTTP 502 for ~35% of requests. us-east-1 unaffected. Symptoms (us-west-2 only): - 502 rate: 35% (baseline 0.05%) - p99 latency: 4s (mostly DNS resolution timeouts in error trace) - nginx upstream errors: "upstream timed out (110: Connection timed out) while connecting to upstream" Recent changes: - 21:40 UTC — DNS team changed TTL for internal-services.example.com from 300s → 30s as part of a planned migration - Also changed CNAME target: internal-services.example.com now points to a NEW NLB (was pointing to a legacy ALB) - Old ALB still up and answering, but its DNS records were not removed - New NLB was created in us-east-1 only — no us-west-2 record exists Logs: - nginx in us-west-2 trying to resolve internal-services.example.com → getting NXDOMAIN or stale IPs - ~30% of pods have cached the old ALB IP and are working; rest are failing - us-east-1 pods all resolve correctly to the new NLB On-call: - 22:08 — pager - 22:11 — DNS team confirms the planned change - 22:12 — debate: revert DNS or hot-patch us-west-2 to point at us-east-1 NLB
Summary
us-west-2区域api-gateway因DNS变更导致35%请求返回502错误,p99延迟升至4秒。DNS团队将internal-services.example.com的TTL从300秒改为30秒,并将CNAME指向仅部署在us-east-1的新NLB,导致us-west-2解析失败。
Severity reasoning: 影响35%用户请求,但仅限us-west-2区域,非全局故障;无数据丢失或收入直接损失;可通过DNS回滚或配置修复快速恢复。
deepseek-chat·prompt v1·output: zh·12547ms·1814↑ / 1608↓ tok·$0.00226
Root cause hypotheses
- highDNS CNAME指向的NLB仅部署在us-east-1,us-west-2无对应记录,导致解析失败
Evidence: us-west-2 nginx日志显示NXDOMAIN或超时;us-east-1正常解析到新NLB
- highTTL从300秒改为30秒导致DNS缓存快速过期,但新记录未正确传播到us-west-2
Evidence: 变更发生在21:40,22:10出现故障;部分pod仍使用旧ALB IP正常工作
- medium新NLB的健康检查或安全组配置错误,拒绝来自us-west-2的流量
Evidence: us-west-2 nginx连接超时,但us-east-1正常;需验证NLB监听器配置
- lowDNS解析器缓存了旧ALB IP但新NLB IP不可达,导致间歇性故障
Evidence: 部分pod使用旧IP成功,其余失败;旧ALB仍在线但未清理
Investigation checklist
- 确认us-west-2 DNS解析结果
kubectl exec -n ingress -it <nginx-pod> -- nslookup internal-services.example.comExpected: 应返回新NLB IP(us-east-1)或旧ALB IP;若返回NXDOMAIN或超时则确认DNS问题
- 检查新NLB是否在us-west-2存在
aws elbv2 describe-load-balancers --region us-west-2 --names new-nlb-nameExpected: 返回空或不存在,确认NLB仅部署在us-east-1
- 查看nginx upstream配置
kubectl exec -n ingress -it <nginx-pod> -- cat /etc/nginx/conf.d/upstream.conf | grep internal-servicesExpected: 应显示使用域名internal-services.example.com,而非硬编码IP
- 检查DNS变更记录
aws route53 list-resource-record-sets --hosted-zone-id <zone-id> --query "ResourceRecordSets[?Name=='internal-services.example.com.']"Expected: 显示CNAME指向新NLB的DNS名称,TTL为30秒
- 验证旧ALB是否仍可访问
curl -I http://<old-alb-dns>/healthExpected: 返回200,确认旧ALB仍在运行
Mitigation plan
立即回滚DNS变更:将internal-services.example.com的CNAME改回旧ALB,TTL恢复300秒
Risk: 回滚期间DNS传播延迟,可能短暂加剧故障;但旧ALB仍在运行,风险低
Rollback: 重新应用当前DNS记录(新NLB),但需先修复us-west-2解析问题
若回滚不可行,在us-west-2创建新NLB并更新DNS记录指向该NLB
Risk: 创建NLB需数分钟,期间故障持续;需确保安全组和健康检查配置正确
Rollback: 删除us-west-2 NLB并回滚DNS到旧ALB
临时在us-west-2 nginx配置中硬编码旧ALB IP作为upstream,绕过DNS解析
Risk: 硬编码IP不灵活,后续DNS变更需手动更新;可能违反配置管理规范
Rollback: 恢复nginx配置使用域名解析
Customer impact
us-west-2区域约35%用户请求失败,返回502错误;受影响用户无法正常使用服务,p99延迟升至4秒。其他区域正常。预计在DNS回滚后10-15分钟内恢复。
Postmortem draft
故障复盘:us-west-2 api-gateway 502错误
摘要
[填写摘要]
时间线
- 21:40 UTC: DNS团队变更internal-services.example.com TTL和CNAME
- 22:08 UTC: 收到告警
- 22:10 UTC: 确认故障
- 22:12 UTC: 讨论回滚方案
- [实际恢复时间]
影响范围
- 服务: api-gateway (us-west-2)
- 错误率: 35% 502
- 延迟: p99 4s
- 用户: us-west-2区域用户
根因
DNS CNAME指向仅部署在us-east-1的NLB,导致us-west-2解析失败。
做得好的
- 快速定位到DNS变更
- us-east-1未受影响
做得不好的
- DNS变更未考虑多区域部署
- 变更前未验证新NLB在所有区域可用
- 旧ALB未及时清理,但本次避免了完全中断
行动项
- [ ] 建立DNS变更多区域验证流程
- [ ] 添加DNS解析监控告警
- [ ] 审查所有DNS记录,确保CNAME目标跨区域可用
Follow-ups
- P0建立DNS变更多区域验证流程,要求变更前确认目标服务在所有区域可用— 平台团队
- P1添加DNS解析失败告警,监控internal-services.example.com在us-west-2的解析成功率— on-call SRE
- P2清理旧ALB DNS记录,避免后续混淆— 服务负责人
- P1审查所有跨区域DNS记录,确保CNAME目标在依赖区域存在— 平台团队
- P2在nginx配置中添加upstream备用IP,作为DNS故障时的降级方案— 服务负责人
Similar past incidents
lexical match (pg_trgm)
- 78%
[Eval][v1][zh] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 65%
[Eval][v2][zh] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 44%
[Eval][v2][en] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 44%
[Eval][v1][en] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 18%
[Eval][v1][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts