[Eval][v1][zh] Regional 5xx spike after DNS TTL change
service: api-gatewaycreated: 6/2/2026, 11:39:44 PM
Raw incident context
Time: 22:10 UTC. api-gateway in us-west-2 returning HTTP 502 for ~35% of requests. us-east-1 unaffected. Symptoms (us-west-2 only): - 502 rate: 35% (baseline 0.05%) - p99 latency: 4s (mostly DNS resolution timeouts in error trace) - nginx upstream errors: "upstream timed out (110: Connection timed out) while connecting to upstream" Recent changes: - 21:40 UTC — DNS team changed TTL for internal-services.example.com from 300s → 30s as part of a planned migration - Also changed CNAME target: internal-services.example.com now points to a NEW NLB (was pointing to a legacy ALB) - Old ALB still up and answering, but its DNS records were not removed - New NLB was created in us-east-1 only — no us-west-2 record exists Logs: - nginx in us-west-2 trying to resolve internal-services.example.com → getting NXDOMAIN or stale IPs - ~30% of pods have cached the old ALB IP and are working; rest are failing - us-east-1 pods all resolve correctly to the new NLB On-call: - 22:08 — pager - 22:11 — DNS team confirms the planned change - 22:12 — debate: revert DNS or hot-patch us-west-2 to point at us-east-1 NLB
Summary
us-west-2 区域 api-gateway 因 DNS 变更导致 35% 请求返回 502 错误,p99 延迟升至 4 秒。DNS 团队将 internal-services.example.com 的 TTL 从 300s 改为 30s,并将 CNAME 指向新建的仅部署在 us-east-1 的 NLB,导致 us-west-2 解析失败。
Severity reasoning: 影响 35% 用户请求,出现 502 错误和延迟飙升,但 us-east-1 正常,非完全 outage,未造成数据丢失或收入损失,符合 SEV2 定义。
deepseek-chat·prompt v1·output: zh·8053ms·1814↑ / 1315↓ tok·$0.00194
Root cause hypotheses
- highDNS 变更导致 us-west-2 无法解析 internal-services.example.com
Evidence: nginx 日志显示 upstream 连接超时,DNS 解析超时;us-west-2 部分 pod 使用缓存旧 IP 正常工作,其余失败。
- high新 NLB 仅部署在 us-east-1,us-west-2 无对应记录
Evidence: DNS 团队确认新 NLB 创建在 us-east-1,us-west-2 无记录;us-east-1 所有 pod 正常解析。
- lowTTL 缩短导致缓存失效后大量并发解析请求压垮 DNS
Evidence: TTL 从 300s 改为 30s 后,DNS 查询频率增加,但未观察到 DNS 服务器负载异常。
Investigation checklist
- 确认 us-west-2 的 DNS 解析结果
kubectl exec -n ingress -it <pod-name> -- nslookup internal-services.example.comExpected: 应返回 NXDOMAIN 或错误,而非旧 ALB IP
- 检查 nginx 错误日志中的 upstream 超时
kubectl logs -n ingress -l app=nginx --tail=100 | grep 'upstream timed out'Expected: 大量 'upstream timed out' 日志,指向 internal-services.example.com
- 确认 us-east-1 解析正常
kubectl exec -n ingress -it <pod-name> -- nslookup internal-services.example.comExpected: 返回新 NLB 的 IP 地址
- 检查 DNS 记录配置
dig internal-services.example.com @<dns-server> +shortExpected: 返回新 NLB 的 CNAME 或 IP,但 us-west-2 区域可能无记录
Mitigation plan
回滚 DNS 变更:将 internal-services.example.com 的 CNAME 指向旧 ALB,TTL 恢复为 300s
Risk: 短暂 DNS 传播延迟,期间部分请求仍可能失败;回滚后旧 ALB 需继续运行
Rollback: 重新应用新 NLB 的 CNAME 记录
临时在 us-west-2 创建指向旧 ALB 的 DNS 记录,或直接修改 nginx 配置硬编码旧 ALB IP
Risk: 硬编码 IP 可能导致后续变更遗漏;需确保旧 ALB 仍有容量
Rollback: 恢复 nginx 配置使用 DNS 解析
Customer impact
us-west-2 区域约 35% 用户请求返回 502 错误,延迟升高至 4 秒。us-east-1 用户无影响。预计在 DNS 回滚后 5-10 分钟内恢复。
Postmortem draft
事后复盘:api-gateway 502 故障
摘要
us-west-2 区域因 DNS 变更导致 35% 请求失败。
时间线
- 21:40 UTC: DNS 团队修改 internal-services.example.com 的 TTL 和 CNAME
- 22:08 UTC: 收到告警
- 22:10 UTC: 确认故障
- 22:12 UTC: 开始回滚
- 22:20 UTC: 恢复
影响
- 35% 请求 502,p99 4s
- 仅 us-west-2
根因
DNS 变更后 us-west-2 无法解析新 NLB。
做得好的
- 快速定位 DNS 变更
- 回滚决策果断
做得不好的
- DNS 变更未考虑多区域
- 变更前未验证新 NLB 可用性
行动项
- 增加 DNS 变更的自动化验证
- 确保新服务部署前在所有区域就绪
Follow-ups
- P1DNS 变更流程增加多区域验证步骤— 平台团队
- P1为 api-gateway 添加 DNS 解析健康检查告警— on-call SRE
- P2审查所有 DNS 记录,确保 CNAME 目标在所有区域可用— 服务负责人
- P2编写 DNS 变更回滚 runbook— 平台团队
Similar past incidents
lexical match (pg_trgm)
- 63%
[Eval][v1][zh] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 54%
[Eval][v2][zh] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 47%
[Eval][v2][en] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 46%
[Eval][v1][en] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 20%
[Eval][v1][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts