[Eval][v2][zh] Regional 5xx spike after DNS TTL change
service: api-gatewaycreated: 6/3/2026, 12:03:22 AM
Raw incident context
Time: 22:10 UTC. api-gateway in us-west-2 returning HTTP 502 for ~35% of requests. us-east-1 unaffected. Symptoms (us-west-2 only): - 502 rate: 35% (baseline 0.05%) - p99 latency: 4s (mostly DNS resolution timeouts in error trace) - nginx upstream errors: "upstream timed out (110: Connection timed out) while connecting to upstream" Recent changes: - 21:40 UTC — DNS team changed TTL for internal-services.example.com from 300s → 30s as part of a planned migration - Also changed CNAME target: internal-services.example.com now points to a NEW NLB (was pointing to a legacy ALB) - Old ALB still up and answering, but its DNS records were not removed - New NLB was created in us-east-1 only — no us-west-2 record exists Logs: - nginx in us-west-2 trying to resolve internal-services.example.com → getting NXDOMAIN or stale IPs - ~30% of pods have cached the old ALB IP and are working; rest are failing - us-east-1 pods all resolve correctly to the new NLB On-call: - 22:08 — pager - 22:11 — DNS team confirms the planned change - 22:12 — debate: revert DNS or hot-patch us-west-2 to point at us-east-1 NLB
Summary
api-gateway 在 us-west-2 区域出现 35% 的 502 错误和 p99 4s 延迟,原因是 DNS 团队在 21:40 UTC 将 internal-services.example.com 的 CNAME 指向了一个仅在 us-east-1 创建的新 NLB,导致 us-west-2 的 pod 无法解析到有效上游。
Severity reasoning: 用户可见故障:502 错误率 35% 超过 1% 阈值,且持续时间超过 5 分钟(22:10 发现,21:40 变更),属于 SEV1 标准。
deepseek-chat·prompt v2·output: zh·11629ms·2195↑ / 1726↓ tok·$0.00249
Root cause hypotheses
- highDNS 变更导致 us-west-2 无法解析上游地址
Evidence: nginx 日志显示 'upstream timed out (110: Connection timed out) while connecting to upstream',且 us-west-2 部分 pod 返回 NXDOMAIN。
- high新 NLB 未在 us-west-2 创建,导致跨区域解析失败
Evidence: DNS 团队确认新 NLB 仅在 us-east-1 创建,us-west-2 无对应记录。
- medium旧 ALB 的 DNS 记录未清理导致部分 pod 使用过期缓存
Evidence: 约 30% 的 pod 仍能正常工作,说明它们缓存了旧 ALB 的 IP。
- mediumDNS TTL 从 300s 改为 30s 加剧了缓存不一致
Evidence: TTL 缩短导致更多 pod 快速获取到错误的 DNS 记录。
- lowapi-gateway 配置中未指定上游区域,导致跨区域解析失败
Evidence: us-east-1 正常,说明配置可能依赖区域特定的 DNS 解析。
Investigation checklist
- 确认 us-west-2 的 DNS 解析结果
kubectl exec -n prod -l app=api-gateway -- nslookup internal-services.example.com 2>&1 | head -20Expected: 应返回 NXDOMAIN 或错误的 IP(非 us-west-2 NLB IP)
- 检查 us-west-2 的 nginx 上游错误日志
kubectl logs -n prod -l app=api-gateway --since=30m | grep -i 'upstream timed out' | head -20Expected: 大量 'Connection timed out' 错误
- 验证新 NLB 的 us-west-2 监听器是否存在
aws elbv2 describe-load-balancers --region us-west-2 --names new-nlb-name 2>&1Expected: 返回 'LoadBalancerNotFound' 错误
- 检查 us-east-1 的 DNS 解析是否正常
kubectl exec -n prod -l app=api-gateway -- nslookup internal-services.example.com 2>&1 | head -20Expected: 返回正确的 us-east-1 NLB IP
- 确认旧 ALB 是否仍在运行
aws elbv2 describe-load-balancers --region us-west-2 --names old-alb-name 2>&1Expected: 返回旧 ALB 的详细信息,状态 active
Mitigation plan
立即回滚 DNS 变更:将 internal-services.example.com 的 CNAME 指向旧 ALB,并恢复 TTL 为 300s
Risk: 回滚期间 DNS 传播延迟可能导致部分 pod 继续失败,但不会造成新问题。
Rollback: 重新应用当前 DNS 记录(指向新 NLB),但需先修复 us-west-2 的 NLB 缺失问题。
在 us-west-2 创建新 NLB,并将 DNS 记录指向该 NLB
Risk: 创建 NLB 需要时间,且可能引入新的配置错误。
Rollback: 删除 us-west-2 的 NLB,并回滚 DNS 到旧 ALB。
临时将 us-west-2 的 api-gateway 流量导向 us-east-1 的 NLB(通过修改上游配置或使用全局加速器)
Risk: 跨区域流量增加延迟,且 us-east-1 可能过载。
Rollback: 恢复 us-west-2 的上游配置为本地解析。
Customer impact
us-west-2 区域约 35% 的用户请求返回 502 错误,页面加载缓慢(p99 4s)。受影响用户数约为该区域总用户的 35%。
Postmortem draft
Summary
[FILL IN]
Timeline (UTC)
- 21:40 - DNS 团队将 internal-services.example.com 的 CNAME 从旧 ALB 改为新 NLB(仅 us-east-1),TTL 从 300s 改为 30s
- 22:08 - 收到告警,502 错误率 35%
- 22:10 - 确认 us-west-2 故障,us-east-1 正常
- 22:11 - DNS 团队确认变更
- 22:12 - 讨论回滚方案
- [FILL IN] - 执行回滚
- [FILL IN] - 服务恢复
Impact
us-west-2 区域 35% 的请求失败,p99 延迟 4s,持续约 [FILL IN] 分钟。
Root Cause
DNS 变更将 internal-services.example.com 指向了一个仅在 us-east-1 创建的 NLB,导致 us-west-2 的 pod 无法解析到有效上游。
Detection
通过监控告警发现 502 错误率飙升。
Response
[FILL IN]
What Went Well
- 快速定位到 DNS 变更
- 团队协作顺畅
What Went Poorly
- DNS 变更未在 us-west-2 验证
- 变更未通知 SRE 团队
Action Items
- [FILL IN]
Follow-ups
- P0DNS 变更流程增加多区域验证步骤— 平台团队
- P1为 api-gateway 添加上游健康检查和自动回滚机制— 服务所有者
- P1审查 DNS 变更的审批和通知流程— 平台团队
- P2增加跨区域 DNS 解析监控— on-call SRE
- P2演练 DNS 变更的回滚流程— 平台团队
Similar past incidents
lexical match (pg_trgm)
- 57%
[Eval][v1][zh] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 55%
[Eval][v2][zh] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 51%
[Eval][v2][en] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 50%
[Eval][v1][en] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 19%
[Eval][v1][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts