[Eval][v3][zh] Regional 5xx spike after DNS TTL change
service: api-gatewaycreated: 6/2/2026, 11:40:45 PM
Raw incident context
Time: 22:10 UTC. api-gateway in us-west-2 returning HTTP 502 for ~35% of requests. us-east-1 unaffected. Symptoms (us-west-2 only): - 502 rate: 35% (baseline 0.05%) - p99 latency: 4s (mostly DNS resolution timeouts in error trace) - nginx upstream errors: "upstream timed out (110: Connection timed out) while connecting to upstream" Recent changes: - 21:40 UTC — DNS team changed TTL for internal-services.example.com from 300s → 30s as part of a planned migration - Also changed CNAME target: internal-services.example.com now points to a NEW NLB (was pointing to a legacy ALB) - Old ALB still up and answering, but its DNS records were not removed - New NLB was created in us-east-1 only — no us-west-2 record exists Logs: - nginx in us-west-2 trying to resolve internal-services.example.com → getting NXDOMAIN or stale IPs - ~30% of pods have cached the old ALB IP and are working; rest are failing - us-east-1 pods all resolve correctly to the new NLB On-call: - 22:08 — pager - 22:11 — DNS team confirms the planned change - 22:12 — debate: revert DNS or hot-patch us-west-2 to point at us-east-1 NLB
Summary
us-west-2区域api-gateway服务因DNS变更导致35%请求返回502错误,p99延迟升至4秒。DNS团队将内部服务域名TTL从300秒改为30秒并更换CNAME目标至仅部署在us-east-1的新NLB,导致us-west-2的pod无法解析到有效IP。
Severity reasoning: 用户可见故障:35%请求返回502错误,超过1%阈值且持续超过5分钟(22:10至22:12已持续2分钟,预计持续更久)。p99延迟4秒严重影响用户体验。符合SEV1定义:用户面故障、错误率>1%超过5分钟。
deepseek-chat·prompt v3·output: zh·12599ms·2446↑ / 1795↓ tok·$0.00263
Root cause hypotheses
- highDNS CNAME目标指向仅存在于us-east-1的NLB,us-west-2无对应记录,导致解析失败
Evidence: 日志显示nginx解析internal-services.example.com返回NXDOMAIN或过期IP;新NLB仅在us-east-1创建。
- highTTL从300秒降至30秒导致DNS缓存快速过期,加剧了解析失败的影响范围
Evidence: 变更记录显示TTL在21:40修改,22:10出现故障,30秒TTL使缓存快速失效。
- high旧ALB的DNS记录未删除,导致部分pod使用缓存IP正常工作,但新解析请求失败
Evidence: 约30% pod使用旧ALB IP正常工作,其余失败,符合缓存未过期与过期混合现象。
- mediumus-west-2的nginx配置未使用备用解析器或健康检查,导致单点故障
Evidence: 无直接证据,但故障仅影响us-west-2,可能因配置差异。
- lowDNS传播延迟导致us-west-2的DNS服务器尚未更新记录
Evidence: 变更后30分钟出现故障,但DNS传播通常更快;且新NLB不存在于us-west-2,即使传播也无法解析。
Investigation checklist
- 确认us-west-2的DNS解析结果
dig internal-services.example.com @<us-west-2-dns-server> +shortExpected: 应返回NLB IP,但可能返回NXDOMAIN或旧ALB IP
- 检查nginx upstream错误日志
kubectl logs -n prod -l app=api-gateway --since=30m | grep 'upstream timed out' | head -20Expected: 显示连接超时,目标IP为旧ALB或无效IP
- 验证新NLB是否在us-west-2存在
aws elbv2 describe-load-balancers --region us-west-2 --names internal-services-nlbExpected: 返回空或错误,确认NLB不存在
- 检查DNS变更记录
aws route53 list-resource-record-sets --hosted-zone-id <zone-id> --query "ResourceRecordSets[?Name=='internal-services.example.com.']"Expected: 显示CNAME指向us-east-1的NLB DNS名
- 检查pod DNS缓存状态
kubectl exec -n prod -l app=api-gateway -- nslookup internal-services.example.com 2>&1 | head -10Expected: 部分pod返回旧IP,部分返回NXDOMAIN
Mitigation plan
立即回滚DNS变更:将internal-services.example.com的CNAME改回旧ALB,TTL恢复300秒
Risk: 旧ALB可能即将下线,但当前仍正常工作;回滚后流量恢复,无数据丢失风险。
Rollback: 重新应用新NLB的CNAME(即再次执行变更),但需先确保us-west-2有NLB。
在us-west-2创建新NLB,并更新DNS记录指向该NLB(替代方案)
Risk: 创建NLB需要时间(约5-10分钟),期间故障持续;可能引入新配置错误。
Rollback: 删除us-west-2 NLB并回滚DNS到旧ALB。
若回滚DNS,需确认旧ALB健康且能处理流量
Risk: 旧ALB可能已缩容或配置变更,导致容量不足。
Rollback: 扩容旧ALB或切换至其他方案。
Customer impact
us-west-2区域约35%的用户请求失败,返回502错误,受影响用户无法正常使用服务。其余请求延迟高达4秒。us-east-1区域用户不受影响。预计在DNS回滚后10分钟内恢复。
Postmortem draft
事后分析报告
摘要
[FILL IN] 简要描述事件:DNS变更导致us-west-2区域api-gateway服务35%请求失败。
时间线(UTC)
- 21:40 DNS团队修改internal-services.example.com的TTL和CNAME
- 22:10 故障开始,502错误率上升
- 22:08 分页触发
- 22:11 DNS团队确认变更
- [FILL IN] 回滚时间
- [FILL IN] 恢复时间
影响
- us-west-2区域35%请求返回502,p99延迟4秒
- 受影响用户数:[FILL IN]
- 持续时间:[FILL IN]
根因
DNS CNAME目标指向仅存在于us-east-1的NLB,us-west-2无法解析,导致nginx upstream连接超时。
检测
通过监控告警发现502错误率飙升,日志显示upstream超时。
响应
- 22:08 分页响应
- 22:11 确认DNS变更为根因
- [FILL IN] 执行回滚
做得好的
- 快速定位根因
- DNS团队及时响应
做得不好的
- DNS变更未考虑多区域部署
- 变更前未验证NLB在所有区域存在
- TTL降低未评估风险
行动项
- [FILL IN] 添加DNS变更的跨区域验证检查
- [FILL IN] 增加DNS解析失败告警
- [FILL IN] 审查变更流程,要求多区域部署验证
Follow-ups
- P0添加DNS变更的跨区域验证检查,确保CNAME目标在所有可用区存在— 平台团队
- P1增加DNS解析失败告警(如NXDOMAIN或解析超时)— 可观测性团队
- P1审查DNS变更流程,要求变更前进行多区域部署验证— SRE团队
- P2为api-gateway添加备用DNS解析器或缓存策略,减少单点故障— 服务所有者
- P2更新runbook,包含DNS相关故障的排查步骤— on-call SRE
Similar past incidents
lexical match (pg_trgm)
- 63%
[Eval][v1][zh] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 55%
[Eval][v2][zh] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 37%
[Eval][v2][en] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 37%
[Eval][v1][en] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 16%
[Eval][v1][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts