[Eval][v1][zh] Regional 5xx spike after DNS TTL change
service: api-gatewaycreated: 6/3/2026, 12:01:32 AM
Raw incident context
Time: 22:10 UTC. api-gateway in us-west-2 returning HTTP 502 for ~35% of requests. us-east-1 unaffected. Symptoms (us-west-2 only): - 502 rate: 35% (baseline 0.05%) - p99 latency: 4s (mostly DNS resolution timeouts in error trace) - nginx upstream errors: "upstream timed out (110: Connection timed out) while connecting to upstream" Recent changes: - 21:40 UTC — DNS team changed TTL for internal-services.example.com from 300s → 30s as part of a planned migration - Also changed CNAME target: internal-services.example.com now points to a NEW NLB (was pointing to a legacy ALB) - Old ALB still up and answering, but its DNS records were not removed - New NLB was created in us-east-1 only — no us-west-2 record exists Logs: - nginx in us-west-2 trying to resolve internal-services.example.com → getting NXDOMAIN or stale IPs - ~30% of pods have cached the old ALB IP and are working; rest are failing - us-east-1 pods all resolve correctly to the new NLB On-call: - 22:08 — pager - 22:11 — DNS team confirms the planned change - 22:12 — debate: revert DNS or hot-patch us-west-2 to point at us-east-1 NLB
Summary
us-west-2区域api-gateway服务因DNS配置变更导致35%请求返回502错误,p99延迟升至4秒。根因是DNS团队将internal-services.example.com的CNAME指向了仅部署在us-east-1的新NLB,us-west-2无对应记录,导致部分pod解析失败。
Severity reasoning: 用户可见的35%请求失败,p99延迟4秒,影响范围覆盖us-west-2区域所有流量,属于用户面故障,符合SEV1标准。
deepseek-chat·prompt v1·output: zh·10767ms·1814↑ / 1337↓ tok·$0.00196
Root cause hypotheses
- highDNS CNAME目标指向了仅存在于us-east-1的NLB,us-west-2无对应A记录,导致解析失败。
Evidence: nginx日志显示upstream连接超时,DNS解析返回NXDOMAIN或过期IP;us-east-1正常,因为新NLB部署在该区域。
- highDNS TTL从300秒改为30秒后,部分pod快速获取了新记录,但新记录指向不可达的NLB。
Evidence: 35%的失败率与TTL变更后缓存刷新比例吻合;成功请求的pod仍使用旧ALB缓存IP。
- low新NLB的健康检查配置错误,导致us-west-2流量被路由到不健康的节点。
Evidence: 但us-west-2根本没有新NLB的DNS记录,健康检查不相关;此假设可能性低。
Investigation checklist
- 确认DNS解析结果
kubectl exec -n ingress -it <pod-name> -- nslookup internal-services.example.comExpected: 部分pod返回旧ALB IP,部分返回NXDOMAIN或新NLB IP(但新NLB IP在us-west-2不可达)
- 检查新NLB的跨区域配置
aws elbv2 describe-load-balancers --region us-west-2 --names new-nlb-nameExpected: 返回空或错误,确认us-west-2无此NLB
- 检查nginx upstream配置
kubectl exec -n ingress -it <pod-name> -- cat /etc/nginx/conf.d/default.conf | grep internal-servicesExpected: upstream指向internal-services.example.com:443
- 确认旧ALB仍健康
aws elbv2 describe-target-health --region us-west-2 --target-group-arn <old-alb-tg-arn>Expected: 所有target健康
Mitigation plan
立即回滚DNS变更:将internal-services.example.com的CNAME指向旧ALB,TTL恢复为300秒。
Risk: 回滚期间DNS传播延迟,部分pod可能继续失败几分钟;无数据丢失风险。
Rollback: 重新应用当前DNS变更(指向新NLB),但需先修复us-west-2的NLB部署。
如果回滚DNS不可行,在us-west-2创建新NLB并配置相同的target group,然后更新DNS记录。
Risk: 创建NLB需要时间,且可能引入新配置错误;需确保安全组和路由正确。
Rollback: 删除us-west-2的新NLB,并回滚DNS到旧ALB。
Customer impact
us-west-2区域约35%的用户请求失败,返回502错误,受影响用户无法正常使用服务。p99延迟升至4秒,体验严重下降。预计在DNS回滚生效后5-10分钟内恢复。
Postmortem draft
事后复盘
摘要
us-west-2区域api-gateway因DNS配置错误导致35%请求失败。
时间线
- 21:40 UTC: DNS团队变更internal-services.example.com的CNAME和TTL
- 22:10 UTC: 告警触发,502错误率上升
- 22:08 UTC: 值班工程师收到pager
- 22:11 UTC: DNS团队确认变更
- 22:12 UTC: 讨论回滚方案
影响
us-west-2区域35%请求失败,p99延迟4秒,持续约30分钟。
根因
DNS CNAME指向仅部署在us-east-1的新NLB,us-west-2无对应记录。
做得好的
- 快速识别DNS变更为根因
- 跨团队协作顺畅
做得不好的
- DNS变更未提前验证跨区域可用性
- 变更未设置灰度或预检查
行动项
- [ ] 建立DNS变更的跨区域验证流程
- [ ] 添加DNS解析监控告警
- [ ] 变更管理增加区域一致性检查
Follow-ups
- P0建立DNS变更的跨区域验证流程,确保新记录在所有目标区域存在— 平台团队
- P1添加DNS解析失败监控告警,覆盖NXDOMAIN和超时— on-call SRE
- P1审查DNS变更管理流程,增加区域一致性检查步骤— 服务负责人
- P2考虑使用DNS健康检查或流量权重进行灰度切换— 平台团队
Similar past incidents
lexical match (pg_trgm)
- 86%
[Eval][v1][zh] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 64%
[Eval][v2][zh] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 44%
[Eval][v2][en] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 44%
[Eval][v1][en] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 18%
[Eval][v1][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts