[Eval][v3][zh] Regional 5xx spike after DNS TTL change
service: api-gatewaycreated: 6/3/2026, 12:05:12 AM
Raw incident context
Time: 22:10 UTC. api-gateway in us-west-2 returning HTTP 502 for ~35% of requests. us-east-1 unaffected. Symptoms (us-west-2 only): - 502 rate: 35% (baseline 0.05%) - p99 latency: 4s (mostly DNS resolution timeouts in error trace) - nginx upstream errors: "upstream timed out (110: Connection timed out) while connecting to upstream" Recent changes: - 21:40 UTC — DNS team changed TTL for internal-services.example.com from 300s → 30s as part of a planned migration - Also changed CNAME target: internal-services.example.com now points to a NEW NLB (was pointing to a legacy ALB) - Old ALB still up and answering, but its DNS records were not removed - New NLB was created in us-east-1 only — no us-west-2 record exists Logs: - nginx in us-west-2 trying to resolve internal-services.example.com → getting NXDOMAIN or stale IPs - ~30% of pods have cached the old ALB IP and are working; rest are failing - us-east-1 pods all resolve correctly to the new NLB On-call: - 22:08 — pager - 22:11 — DNS team confirms the planned change - 22:12 — debate: revert DNS or hot-patch us-west-2 to point at us-east-1 NLB
Summary
us-west-2区域api-gateway服务因DNS变更导致35%请求返回502错误,p99延迟升至4秒。根因是DNS团队将internal-services.example.com的CNAME指向新建的仅部署在us-east-1的NLB,导致us-west-2区域DNS解析失败或返回陈旧IP。
Severity reasoning: 用户可见故障:35%请求返回502错误,超过1%阈值且持续超过5分钟;p99延迟4秒,严重影响用户体验。符合SEV1定义。
deepseek-chat·prompt v3·output: zh·11081ms·2446↑ / 1539↓ tok·$0.00235
Root cause hypotheses
- highDNS变更导致us-west-2区域无法解析internal-services.example.com
Evidence: nginx日志显示'upstream timed out'和DNS解析超时;us-west-2区域35%请求失败,而us-east-1正常。
- high新NLB仅部署在us-east-1,us-west-2无对应记录
Evidence: DNS团队确认新NLB创建在us-east-1;us-west-2区域DNS解析返回NXDOMAIN或陈旧IP。
- mediumDNS TTL从300秒改为30秒导致缓存快速过期,加剧故障
Evidence: 变更记录显示TTL降低,使得更多pod快速获取到无效DNS记录。
- medium旧ALB的DNS记录未移除导致部分pod使用陈旧IP
Evidence: 约30%的pod仍能正常工作,说明它们缓存了旧ALB的IP。
Investigation checklist
- 检查us-west-2 nginx上游错误日志
kubectl logs -n prod -l app=api-gateway --since=30m | grep -E 'upstream timed out|Connection timed out' | head -20Expected: 大量'upstream timed out'错误,确认DNS解析问题
- 验证DNS解析结果
dig internal-services.example.com @<us-west-2-dns-server> +shortExpected: 返回NXDOMAIN或错误IP,而非正确的NLB IP
- 检查新NLB的跨区域配置
aws elbv2 describe-load-balancers --region us-west-2 --names new-nlb-nameExpected: 返回空或错误,确认us-west-2无NLB
- 检查旧ALB状态
aws elbv2 describe-load-balancers --region us-west-2 --names old-alb-nameExpected: 返回旧ALB信息,确认仍在运行
- 检查pod DNS缓存情况
kubectl exec -n prod -l app=api-gateway -- nslookup internal-services.example.com 2>/dev/null | head -10Expected: 部分pod返回旧IP,部分返回NXDOMAIN
Mitigation plan
立即回滚DNS变更:将internal-services.example.com的CNAME指向旧ALB,并恢复TTL为300秒
Risk: 回滚期间DNS传播延迟,可能短暂影响部分请求;但旧ALB仍在运行,风险低
Rollback: 重新应用当前DNS配置(指向新NLB)
在us-west-2创建新NLB并配置目标组,然后更新DNS指向新NLB
Risk: 创建NLB需要时间,期间故障持续;配置错误可能导致新问题
Rollback: 删除新NLB,重新指向旧ALB
临时修改us-west-2 nginx配置,将上游指向旧ALB的IP地址(硬编码)
Risk: 绕过DNS,但旧ALB可能被后续下线;配置变更需重启nginx
Rollback: 恢复nginx配置使用DNS解析
Customer impact
us-west-2区域约35%的用户请求失败,返回502错误;其余用户请求延迟显著增加(p99 4秒)。受影响用户无法正常使用服务。
Postmortem draft
摘要
[FILL IN] 时间:UTC 22:10 发现,[FILL IN] 解决。us-west-2区域api-gateway因DNS配置错误导致35%请求失败。
时间线(UTC)
- 21:40 DNS团队修改internal-services.example.com记录
- 22:08 告警触发
- 22:10 确认故障
- [FILL IN] 回滚DNS
- [FILL IN] 恢复
影响
us-west-2区域35%请求502,p99延迟4秒,持续约[X]分钟。
根因
DNS变更将CNAME指向仅部署在us-east-1的新NLB,导致us-west-2解析失败。
检测
告警系统在22:08触发,但变更后28分钟才告警,检测延迟。
响应
[FILL IN] 回滚决策和执行细节。
做得好的
- DNS团队及时确认变更
- 告警系统正常工作
做得不好的
- DNS变更未跨区域验证
- 旧ALB未及时下线导致混淆
- 变更未通知SRE团队
行动项
- [ ] 建立DNS变更跨区域验证流程
- [ ] 增加DNS解析失败告警
- [ ] 回滚演练
Follow-ups
- P0建立DNS变更跨区域验证流程,确保新资源在所有区域可用— 平台团队
- P1增加DNS解析失败告警,监控上游解析错误率— on-call SRE
- P1清理旧ALB DNS记录,避免混淆— 基础设施团队
- P2审查变更管理流程,要求DNS变更通知SRE团队— 平台团队
- P2演练DNS回滚流程,确保快速执行— on-call SRE
Similar past incidents
lexical match (pg_trgm)
- 77%
[Eval][v1][zh] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 62%
[Eval][v2][zh] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 44%
[Eval][v2][en] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 43%
[Eval][v1][en] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 18%
[Eval][v1][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts