[Eval][v3][zh] Regional 5xx spike after DNS TTL change
service: api-gatewaycreated: 6/3/2026, 12:04:57 AM
Raw incident context
Time: 22:10 UTC. api-gateway in us-west-2 returning HTTP 502 for ~35% of requests. us-east-1 unaffected. Symptoms (us-west-2 only): - 502 rate: 35% (baseline 0.05%) - p99 latency: 4s (mostly DNS resolution timeouts in error trace) - nginx upstream errors: "upstream timed out (110: Connection timed out) while connecting to upstream" Recent changes: - 21:40 UTC — DNS team changed TTL for internal-services.example.com from 300s → 30s as part of a planned migration - Also changed CNAME target: internal-services.example.com now points to a NEW NLB (was pointing to a legacy ALB) - Old ALB still up and answering, but its DNS records were not removed - New NLB was created in us-east-1 only — no us-west-2 record exists Logs: - nginx in us-west-2 trying to resolve internal-services.example.com → getting NXDOMAIN or stale IPs - ~30% of pods have cached the old ALB IP and are working; rest are failing - us-east-1 pods all resolve correctly to the new NLB On-call: - 22:08 — pager - 22:11 — DNS team confirms the planned change - 22:12 — debate: revert DNS or hot-patch us-west-2 to point at us-east-1 NLB
Summary
us-west-2区域API网关35%请求返回502错误,p99延迟升至4秒,由DNS团队将内部服务域名TTL从300秒改为30秒并更换CNAME目标至仅部署在us-east-1的新NLB引起。约30%的pod因缓存旧ALB IP正常工作,其余因DNS解析失败或超时导致502。
Severity reasoning: 用户可见故障:35%请求返回502错误,超过1%阈值且持续超过5分钟;错误率35%远超基线0.05%,符合SEV1标准。
deepseek-chat·prompt v3·output: zh·13851ms·2446↑ / 1828↓ tok·$0.00267
Root cause hypotheses
- highDNS CNAME目标变更未在us-west-2部署新NLB,导致解析失败
Evidence: 日志显示nginx解析internal-services.example.com返回NXDOMAIN或超时;新NLB仅在us-east-1创建,us-west-2无对应记录。
- highTTL从300秒降至30秒加速了DNS缓存失效,使更多pod快速感知到错误的DNS记录
Evidence: 变更发生在21:40,502错误在22:10被报告,30分钟内缓存逐步失效;旧TTL下缓存更新更慢,可能延缓故障显现。
- high旧ALB DNS记录未移除,导致部分pod使用旧IP正常工作,但新NLB在us-west-2不可达
Evidence: 约30% pod正常工作,对应缓存旧ALB IP;其余pod因新NLB无us-west-2记录而失败。
- mediumnginx upstream配置未指定resolve或使用变量,导致DNS解析不遵循新TTL
Evidence: nginx默认缓存DNS直到重启或配置reload;但此处部分pod成功解析旧IP,说明缓存行为不一致,可能因kube-dns或节点级缓存差异。
- lowus-west-2网络策略或安全组阻止了对新NLB的访问
Evidence: 新NLB仅部署在us-east-1,us-west-2无目标组,因此即使解析成功也无法连接;但日志显示的是DNS解析失败而非连接拒绝,故可能性低。
Investigation checklist
- 确认DNS解析结果
kubectl exec -n prod -l app=api-gateway -- nslookup internal-services.example.com 2>/dev/null | head -20Expected: 应显示NXDOMAIN或指向us-east-1 NLB的IP(非us-west-2区域IP)
- 检查nginx错误日志中的DNS超时
kubectl logs -n prod -l app=api-gateway --since=30m | grep -E 'upstream timed out|connection timed out' | head -20Expected: 大量'upstream timed out'日志,确认DNS解析超时
- 验证新NLB在us-west-2是否存在
aws elbv2 describe-load-balancers --region us-west-2 --names internal-services-nlb 2>&1Expected: 返回'LoadBalancerNotFound'错误,确认us-west-2无NLB
- 检查旧ALB是否仍在运行
aws elbv2 describe-load-balancers --region us-west-2 --names legacy-alb 2>&1 | jq '.LoadBalancers[].State.Code'Expected: 返回'active',确认旧ALB仍在服务
- 查看DNS记录变更历史
aws route53 list-resource-record-sets --hosted-zone-id ZONEID --query "ResourceRecordSets[?Name=='internal-services.example.com.']"Expected: 显示CNAME指向新NLB的DNS名称,且TTL为30秒
Mitigation plan
立即回滚DNS变更:将internal-services.example.com的CNAME改回旧ALB,TTL恢复300秒
Risk: 回滚期间DNS传播延迟可能导致部分pod继续失败,但旧ALB仍在运行,风险低
Rollback: 重新应用当前DNS记录(即再次指向新NLB),但需先修复us-west-2 NLB缺失问题
在us-west-2创建新NLB,并将internal-services.example.com指向该NLB(如果回滚不可行)
Risk: 创建NLB需要时间(约5-10分钟),期间故障持续;新NLB配置可能出错
Rollback: 删除us-west-2 NLB并回滚DNS到旧ALB
强制刷新受影响的pod DNS缓存:重启api-gateway pod或使用kubectl delete pod
Risk: 重启pod可能导致短暂连接中断,但可加速恢复;需确保新DNS记录正确
Rollback: 无,重启后无法撤销;但可等待自动恢复
Customer impact
us-west-2区域约35%的用户请求失败,返回502错误;受影响用户看到页面加载失败或超时。p99延迟升至4秒,部分用户经历缓慢响应。us-east-1区域用户不受影响。
Postmortem draft
摘要
us-west-2区域API网关因DNS配置错误导致35%请求返回502错误,持续约[FILL IN]分钟。
时间线(UTC)
- 21:40 DNS团队将internal-services.example.com的TTL从300s改为30s,并更改CNAME指向新NLB(仅部署在us-east-1)
- 22:08 收到告警
- 22:10 确认502错误率35%
- 22:11 DNS团队确认变更
- [FILL IN] 执行回滚
- [FILL IN] 恢复
影响
us-west-2区域35%请求失败,p99延迟4s,影响[FILL IN]用户。
根因
DNS CNAME目标指向仅存在于us-east-1的NLB,us-west-2无对应资源,导致DNS解析失败或超时。
检测
通过错误率告警和nginx日志发现。
响应
[FILL IN] 回滚DNS变更,重启pod。
做得好的
[FILL IN]
做得不好的
[FILL IN]
行动项
- [ ] 增加DNS变更的自动化验证,确保目标资源在所有区域存在
- [ ] 改进告警,对DNS解析失败率设置告警
- [ ] 审查变更流程,要求DNS变更前进行区域一致性检查
Follow-ups
- P0增加DNS变更的自动化验证,确保CNAME目标资源在所有相关区域存在— 平台团队
- P1添加DNS解析失败率告警(基于nginx日志)— on-call SRE
- P1审查并改进变更管理流程,要求DNS变更前进行区域一致性检查— 平台团队
- P2评估nginx DNS缓存策略,考虑使用resolver指令并设置有效时间— 服务所有者
- P2创建runbook处理DNS相关故障,包括回滚步骤和验证命令— on-call SRE
Similar past incidents
lexical match (pg_trgm)
- 59%
[Eval][v2][zh] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 57%
[Eval][v1][zh] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 36%
[Eval][v1][en] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 36%
[Eval][v2][en] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.