← All incidents

[Eval][v2][zh] Regional 5xx spike after DNS TTL change

service: api-gatewaycreated: 5/25/2026, 10:23:20 PM

Raw incident context

Time: 22:10 UTC. api-gateway in us-west-2 returning HTTP 502 for ~35% of requests. us-east-1 unaffected.

Symptoms (us-west-2 only):
- 502 rate: 35% (baseline 0.05%)
- p99 latency: 4s (mostly DNS resolution timeouts in error trace)
- nginx upstream errors: "upstream timed out (110: Connection timed out) while connecting to upstream"

Recent changes:
- 21:40 UTC — DNS team changed TTL for internal-services.example.com from 300s → 30s as part of a planned migration
- Also changed CNAME target: internal-services.example.com now points to a NEW NLB (was pointing to a legacy ALB)
- Old ALB still up and answering, but its DNS records were not removed
- New NLB was created in us-east-1 only — no us-west-2 record exists

Logs:
- nginx in us-west-2 trying to resolve internal-services.example.com → getting NXDOMAIN or stale IPs
- ~30% of pods have cached the old ALB IP and are working; rest are failing
- us-east-1 pods all resolve correctly to the new NLB

On-call:
- 22:08 — pager
- 22:11 — DNS team confirms the planned change
- 22:12 — debate: revert DNS or hot-patch us-west-2 to point at us-east-1 NLB

Summary

SEV1

由于DNS团队在21:40 UTC将internal-services.example.com的CNAME指向仅部署在us-east-1的新NLB,导致us-west-2区域的api-gateway无法解析服务地址,35%的请求返回502错误,p99延迟升至4秒。约30%的pod因缓存旧ALB IP而正常工作,其余pod因NXDOMAIN或超时失败。

Severity reasoning: 用户可见的502错误率35%远超1%阈值,且持续时间超过5分钟,属于用户面中断,符合SEV1定义。

deepseek-chat·prompt v2·output: zh·11271ms

Root cause hypotheses

  • highDNS CNAME记录指向了仅存在于us-east-1的NLB,us-west-2无对应记录,导致解析失败。

    Evidence: nginx日志显示'upstream timed out (110: Connection timed out) while connecting to upstream',且DNS查询返回NXDOMAIN。

  • mediumDNS TTL从300s改为30s导致缓存快速过期,但新记录未正确传播到us-west-2的DNS解析器。

    Evidence: TTL变更发生在21:40 UTC,22:10仍有30% pod使用旧缓存,说明部分缓存未过期,但新记录未生效。

  • lowus-west-2的api-gateway pod存在DNS解析器配置错误,无法查询到新NLB的A记录。

    Evidence: us-east-1所有pod正常工作,说明DNS配置在us-east-1正确,但us-west-2可能使用了不同的上游DNS服务器。

Investigation checklist

  1. 检查us-west-2 api-gateway pod的DNS解析结果
    kubectl exec -n prod -l app=api-gateway -- nslookup internal-services.example.com 2>/dev/null | head -20

    Expected: 应返回新NLB的IP地址(us-west-2区域),若返回NXDOMAIN或旧ALB IP则确认问题。

  2. 验证新NLB在us-west-2是否存在
    aws elbv2 describe-load-balancers --region us-west-2 --names new-nlb-name 2>&1

    Expected: 若返回'LoadBalancerNotFound'则确认NLB未部署在us-west-2。

  3. 检查us-west-2 api-gateway pod的DNS缓存状态
    kubectl exec -n prod -l app=api-gateway -- cat /etc/resolv.conf

    Expected: 确认DNS服务器配置,并检查缓存时间(如使用nscd)。

  4. 对比us-east-1和us-west-2的DNS解析结果
    kubectl exec -n prod -l app=api-gateway -- nslookup internal-services.example.com 2>/dev/null | grep Address

    Expected: us-east-1应返回新NLB IP,us-west-2应返回旧ALB IP或NXDOMAIN。

Mitigation plan

  • 立即回滚DNS变更:将internal-services.example.com的CNAME指向旧的us-west-2 ALB,并恢复TTL为300s。

    Risk: 回滚期间DNS传播延迟可能导致部分用户继续看到502,但影响范围会逐渐缩小。无数据丢失风险。

    Rollback: 重新应用当前错误的DNS记录(指向us-east-1 NLB),但需先确保us-west-2 NLB已部署。

  • 在us-west-2创建新的NLB,并将DNS记录指向该NLB。

    Risk: 创建NLB需要几分钟,期间502错误持续。需确保安全组和健康检查配置正确。

    Rollback: 删除新创建的NLB,并回滚DNS到旧ALB。

Customer impact

us-west-2区域约35%的用户请求失败,返回HTTP 502错误,受影响用户估计占全球用户的15-20%。其余用户正常。预计在DNS回滚后15分钟内恢复。

Postmortem draft

Summary

由于DNS团队在21:40 UTC将internal-services.example.com的CNAME指向仅部署在us-east-1的新NLB,导致us-west-2区域的api-gateway无法解析服务地址,35%的请求返回502错误,p99延迟升至4秒。约30%的pod因缓存旧ALB IP而正常工作,其余pod因NXDOMAIN或超时失败。

Timeline (UTC)

  • 21:40 - DNS团队变更internal-services.example.com的CNAME和TTL
  • 22:08 - Pager触发
  • 22:10 - 确认502错误率35%
  • 22:11 - DNS团队确认变更
  • 22:12 - 讨论回滚方案
  • [FILL IN] - DNS回滚完成
  • [FILL IN] - 错误率恢复

Impact

  • us-west-2区域35%请求失败,持续约[FILL IN]分钟
  • p99延迟从基线[FILL IN]升至4s
  • 估计受影响用户数:[FILL IN]

Root Cause

DNS CNAME记录指向了仅存在于us-east-1的NLB,us-west-2无对应记录,导致解析失败。

Detection

  • 监控告警:502错误率超过阈值触发pager
  • 检测延迟:从变更到告警约28分钟

Response

  • 22:08 - 接收告警
  • 22:11 - 确认DNS变更为根因
  • [FILL IN] - 执行DNS回滚

What Went Well

  • 监控告警及时
  • DNS团队快速响应

What Went Poorly

  • DNS变更未验证多区域部署
  • 变更未通知SRE团队

Action Items

  • [FILL IN] 增加DNS变更的自动化验证步骤
  • [FILL IN] 建立多区域DNS部署检查清单

Follow-ups

  • P0增加DNS变更的自动化验证步骤,确保CNAME目标在所有区域存在平台团队
  • P1建立多区域DNS部署检查清单,要求变更前验证所有区域SRE团队
  • P2优化告警阈值,减少检测延迟监控团队
  • P1审查DNS变更流程,增加SRE审批环节平台团队