← All incidents

[Eval][v3][zh] Regional 5xx spike after DNS TTL change

service: api-gatewaycreated: 6/3/2026, 12:05:12 AM

Raw incident context

Time: 22:10 UTC. api-gateway in us-west-2 returning HTTP 502 for ~35% of requests. us-east-1 unaffected.

Symptoms (us-west-2 only):
- 502 rate: 35% (baseline 0.05%)
- p99 latency: 4s (mostly DNS resolution timeouts in error trace)
- nginx upstream errors: "upstream timed out (110: Connection timed out) while connecting to upstream"

Recent changes:
- 21:40 UTC — DNS team changed TTL for internal-services.example.com from 300s → 30s as part of a planned migration
- Also changed CNAME target: internal-services.example.com now points to a NEW NLB (was pointing to a legacy ALB)
- Old ALB still up and answering, but its DNS records were not removed
- New NLB was created in us-east-1 only — no us-west-2 record exists

Logs:
- nginx in us-west-2 trying to resolve internal-services.example.com → getting NXDOMAIN or stale IPs
- ~30% of pods have cached the old ALB IP and are working; rest are failing
- us-east-1 pods all resolve correctly to the new NLB

On-call:
- 22:08 — pager
- 22:11 — DNS team confirms the planned change
- 22:12 — debate: revert DNS or hot-patch us-west-2 to point at us-east-1 NLB

Summary

SEV1

us-west-2区域api-gateway服务因DNS变更导致35%请求返回502错误,p99延迟升至4秒。根因是DNS团队将internal-services.example.com的CNAME指向新建的仅部署在us-east-1的NLB,导致us-west-2区域DNS解析失败或返回陈旧IP。

Severity reasoning: 用户可见故障:35%请求返回502错误,超过1%阈值且持续超过5分钟;p99延迟4秒,严重影响用户体验。符合SEV1定义。

deepseek-chat·prompt v3·output: zh·11081ms·2446↑ / 1539↓ tok·$0.00235

Root cause hypotheses

  • highDNS变更导致us-west-2区域无法解析internal-services.example.com

    Evidence: nginx日志显示'upstream timed out'和DNS解析超时;us-west-2区域35%请求失败,而us-east-1正常。

  • high新NLB仅部署在us-east-1,us-west-2无对应记录

    Evidence: DNS团队确认新NLB创建在us-east-1;us-west-2区域DNS解析返回NXDOMAIN或陈旧IP。

  • mediumDNS TTL从300秒改为30秒导致缓存快速过期,加剧故障

    Evidence: 变更记录显示TTL降低,使得更多pod快速获取到无效DNS记录。

  • medium旧ALB的DNS记录未移除导致部分pod使用陈旧IP

    Evidence: 约30%的pod仍能正常工作,说明它们缓存了旧ALB的IP。

Investigation checklist

  1. 检查us-west-2 nginx上游错误日志
    kubectl logs -n prod -l app=api-gateway --since=30m | grep -E 'upstream timed out|Connection timed out' | head -20

    Expected: 大量'upstream timed out'错误,确认DNS解析问题

  2. 验证DNS解析结果
    dig internal-services.example.com @<us-west-2-dns-server> +short

    Expected: 返回NXDOMAIN或错误IP,而非正确的NLB IP

  3. 检查新NLB的跨区域配置
    aws elbv2 describe-load-balancers --region us-west-2 --names new-nlb-name

    Expected: 返回空或错误,确认us-west-2无NLB

  4. 检查旧ALB状态
    aws elbv2 describe-load-balancers --region us-west-2 --names old-alb-name

    Expected: 返回旧ALB信息,确认仍在运行

  5. 检查pod DNS缓存情况
    kubectl exec -n prod -l app=api-gateway -- nslookup internal-services.example.com 2>/dev/null | head -10

    Expected: 部分pod返回旧IP,部分返回NXDOMAIN

Mitigation plan

  • 立即回滚DNS变更:将internal-services.example.com的CNAME指向旧ALB,并恢复TTL为300秒

    Risk: 回滚期间DNS传播延迟,可能短暂影响部分请求;但旧ALB仍在运行,风险低

    Rollback: 重新应用当前DNS配置(指向新NLB)

  • 在us-west-2创建新NLB并配置目标组,然后更新DNS指向新NLB

    Risk: 创建NLB需要时间,期间故障持续;配置错误可能导致新问题

    Rollback: 删除新NLB,重新指向旧ALB

  • 临时修改us-west-2 nginx配置,将上游指向旧ALB的IP地址(硬编码)

    Risk: 绕过DNS,但旧ALB可能被后续下线;配置变更需重启nginx

    Rollback: 恢复nginx配置使用DNS解析

Customer impact

us-west-2区域约35%的用户请求失败,返回502错误;其余用户请求延迟显著增加(p99 4秒)。受影响用户无法正常使用服务。

Postmortem draft

摘要

[FILL IN] 时间:UTC 22:10 发现,[FILL IN] 解决。us-west-2区域api-gateway因DNS配置错误导致35%请求失败。

时间线(UTC)

  • 21:40 DNS团队修改internal-services.example.com记录
  • 22:08 告警触发
  • 22:10 确认故障
  • [FILL IN] 回滚DNS
  • [FILL IN] 恢复

影响

us-west-2区域35%请求502,p99延迟4秒,持续约[X]分钟。

根因

DNS变更将CNAME指向仅部署在us-east-1的新NLB,导致us-west-2解析失败。

检测

告警系统在22:08触发,但变更后28分钟才告警,检测延迟。

响应

[FILL IN] 回滚决策和执行细节。

做得好的

  • DNS团队及时确认变更
  • 告警系统正常工作

做得不好的

  • DNS变更未跨区域验证
  • 旧ALB未及时下线导致混淆
  • 变更未通知SRE团队

行动项

  • [ ] 建立DNS变更跨区域验证流程
  • [ ] 增加DNS解析失败告警
  • [ ] 回滚演练

Follow-ups

  • P0建立DNS变更跨区域验证流程,确保新资源在所有区域可用平台团队
  • P1增加DNS解析失败告警,监控上游解析错误率on-call SRE
  • P1清理旧ALB DNS记录,避免混淆基础设施团队
  • P2审查变更管理流程,要求DNS变更通知SRE团队平台团队
  • P2演练DNS回滚流程,确保快速执行on-call SRE