← All incidents

[Eval][v1][zh] Regional 5xx spike after DNS TTL change

service: api-gatewaycreated: 6/3/2026, 12:01:48 AM

Raw incident context

Time: 22:10 UTC. api-gateway in us-west-2 returning HTTP 502 for ~35% of requests. us-east-1 unaffected.

Symptoms (us-west-2 only):
- 502 rate: 35% (baseline 0.05%)
- p99 latency: 4s (mostly DNS resolution timeouts in error trace)
- nginx upstream errors: "upstream timed out (110: Connection timed out) while connecting to upstream"

Recent changes:
- 21:40 UTC — DNS team changed TTL for internal-services.example.com from 300s → 30s as part of a planned migration
- Also changed CNAME target: internal-services.example.com now points to a NEW NLB (was pointing to a legacy ALB)
- Old ALB still up and answering, but its DNS records were not removed
- New NLB was created in us-east-1 only — no us-west-2 record exists

Logs:
- nginx in us-west-2 trying to resolve internal-services.example.com → getting NXDOMAIN or stale IPs
- ~30% of pods have cached the old ALB IP and are working; rest are failing
- us-east-1 pods all resolve correctly to the new NLB

On-call:
- 22:08 — pager
- 22:11 — DNS team confirms the planned change
- 22:12 — debate: revert DNS or hot-patch us-west-2 to point at us-east-1 NLB

Summary

SEV2

us-west-2区域api-gateway因DNS变更导致35%请求返回502错误,p99延迟升至4秒。DNS团队将internal-services.example.com的TTL从300s改为30s,并将CNAME指向新建的仅部署在us-east-1的NLB,导致us-west-2区域解析失败或使用过期IP。

Severity reasoning: 影响35%用户请求,但仅限us-west-2区域,非全局故障;无数据丢失或损坏;可通过DNS回滚或配置修复快速恢复。

deepseek-chat·prompt v1·output: zh·10399ms·1814↑ / 1312↓ tok·$0.00193

Root cause hypotheses

  • highDNS CNAME指向了仅存在于us-east-1的NLB,us-west-2无对应记录,导致解析失败。

    Evidence: us-west-2 nginx日志显示NXDOMAIN或超时;us-east-1正常。

  • mediumTTL从300s改为30s导致DNS缓存快速过期,但新记录未正确传播到us-west-2的DNS解析器。

    Evidence: 变更发生在21:40,22:10仍存在35%故障,部分pod使用旧ALB IP正常工作。

  • low新NLB的健康检查配置错误,导致us-west-2流量被路由到不健康的端点。

    Evidence: 无直接证据,但502错误与连接超时相关,可能目标组未正确配置。

Investigation checklist

  1. 确认us-west-2 DNS解析结果
    kubectl exec -n ingress -it <nginx-pod> -- nslookup internal-services.example.com 2>/dev/null || dig internal-services.example.com @<dns-server>

    Expected: 应返回us-west-2 NLB IP;若返回NXDOMAIN或旧ALB IP,则确认DNS问题。

  2. 检查新NLB的us-west-2目标组配置
    aws elbv2 describe-target-groups --region us-west-2 --names <new-nlb-tg-name>

    Expected: 目标组应存在且包含健康实例;若不存在,则NLB未部署到us-west-2。

  3. 验证DNS变更记录
    aws route53 list-resource-record-sets --hosted-zone-id <zone-id> --query "ResourceRecordSets[?Name=='internal-services.example.com.']"

    Expected: CNAME记录应指向us-west-2 NLB的DNS名称;若指向us-east-1,则确认问题。

  4. 检查nginx upstream配置
    kubectl exec -n ingress -it <nginx-pod> -- cat /etc/nginx/conf.d/upstream.conf | grep internal-services

    Expected: upstream应使用域名而非硬编码IP;若使用IP,则需更新。

Mitigation plan

  • 立即回滚DNS变更:将internal-services.example.com的CNAME指向旧ALB,TTL恢复为300s。

    Risk: 回滚期间DNS传播延迟,可能短暂加剧故障;但旧ALB仍运行,风险低。

    Rollback: 重新应用新NLB的CNAME记录。

  • 若回滚不可行,在us-west-2创建NLB并更新DNS记录指向该NLB。

    Risk: 创建NLB需要时间,且可能引入新配置错误;需确保安全组和健康检查正确。

    Rollback: 删除新NLB并回滚DNS到旧ALB。

Customer impact

us-west-2区域约35%用户请求失败,返回502错误,受影响用户无法正常使用服务。其他区域正常。预计在DNS回滚生效后5-10分钟内恢复。

Postmortem draft

事后复盘

摘要

us-west-2区域api-gateway因DNS配置错误导致35%请求失败。

时间线

  • 21:40 DNS团队变更TTL和CNAME
  • 22:08 告警触发
  • 22:10 确认故障
  • 22:XX 回滚DNS
  • 22:XX 恢复

影响

us-west-2区域35%用户,持续约XX分钟。

根因

DNS CNAME指向仅存在于us-east-1的NLB,us-west-2无对应记录。

做得好的

  • 快速告警和响应
  • DNS团队及时沟通

做得不好的

  • 变更未在us-west-2验证
  • 缺少预发布检查

行动项

  • [ ] 增加DNS变更的跨区域验证
  • [ ] 添加DNS解析监控
  • [ ] 改进变更流程

Follow-ups

  • P1增加DNS变更的跨区域验证步骤到变更流程平台团队
  • P1添加DNS解析失败监控和告警on-call SRE
  • P2审查DNS变更流程,增加预发布检查清单服务负责人
  • P2考虑使用服务网格或本地DNS缓存减少DNS依赖平台团队