[Eval][v3][zh] Regional 5xx spike after DNS TTL change
service: api-gatewaycreated: 6/3/2026, 12:05:29 AM
Raw incident context
Time: 22:10 UTC. api-gateway in us-west-2 returning HTTP 502 for ~35% of requests. us-east-1 unaffected. Symptoms (us-west-2 only): - 502 rate: 35% (baseline 0.05%) - p99 latency: 4s (mostly DNS resolution timeouts in error trace) - nginx upstream errors: "upstream timed out (110: Connection timed out) while connecting to upstream" Recent changes: - 21:40 UTC — DNS team changed TTL for internal-services.example.com from 300s → 30s as part of a planned migration - Also changed CNAME target: internal-services.example.com now points to a NEW NLB (was pointing to a legacy ALB) - Old ALB still up and answering, but its DNS records were not removed - New NLB was created in us-east-1 only — no us-west-2 record exists Logs: - nginx in us-west-2 trying to resolve internal-services.example.com → getting NXDOMAIN or stale IPs - ~30% of pods have cached the old ALB IP and are working; rest are failing - us-east-1 pods all resolve correctly to the new NLB On-call: - 22:08 — pager - 22:11 — DNS team confirms the planned change - 22:12 — debate: revert DNS or hot-patch us-west-2 to point at us-east-1 NLB
Summary
us-west-2区域API网关35%请求返回502错误,p99延迟4秒,原因是DNS团队在21:40 UTC将internal-services.example.com的CNAME指向了仅部署在us-east-1的新NLB,导致us-west-2的pod无法解析到有效IP。
Severity reasoning: 用户可见故障:35%请求返回502错误,超过1%阈值且持续超过5分钟(22:08触发告警,至今未恢复)。符合SEV1定义中的'用户面故障'和'错误率>1%超过5分钟'。
deepseek-chat·prompt v3·output: zh·12096ms·2446↑ / 1547↓ tok·$0.00236
Root cause hypotheses
- highDNS CNAME指向了仅存在于us-east-1的NLB,导致us-west-2解析失败
Evidence: 日志显示nginx在us-west-2解析internal-services.example.com时返回NXDOMAIN或过期IP;新NLB仅在us-east-1创建。
- highDNS TTL从300秒改为30秒导致缓存快速过期,但新记录未覆盖所有区域
Evidence: 21:40 UTC更改TTL和CNAME,22:08出现故障,约30% pod使用旧缓存正常工作,其余失败。
- low旧ALB DNS记录被删除导致回退失败
Evidence: 上下文提到旧ALB仍在线但DNS记录未移除,但未说明是否被删除。若被删除,则无法回退。
- lowus-west-2的DNS解析器缓存了错误的SOA记录
Evidence: 无直接证据,但区域级DNS问题可能导致部分pod解析失败。
Investigation checklist
- 确认us-west-2的DNS解析结果
kubectl exec -n prod -l app=api-gateway -- nslookup internal-services.example.com 2>/dev/null | head -20Expected: 应返回us-west-2 NLB的IP;若返回NXDOMAIN或us-east-1 IP,则确认问题。
- 检查nginx upstream错误日志
kubectl logs -n prod -l app=api-gateway --since=30m | grep 'upstream timed out' | head -20Expected: 应显示大量'upstream timed out'错误,确认DNS解析超时。
- 验证新NLB的跨区域配置
aws elbv2 describe-load-balancers --region us-west-2 --names internal-services-nlb 2>&1Expected: 应返回'LoadBalancerNotFound'错误,证明us-west-2无NLB。
- 检查旧ALB是否仍在us-west-2运行
aws elbv2 describe-load-balancers --region us-west-2 --names legacy-alb 2>&1Expected: 应返回ALB详情,确认旧ALB可用。
Mitigation plan
立即回滚DNS更改:将internal-services.example.com的CNAME指向旧ALB,并将TTL恢复为300秒
Risk: 回滚期间DNS传播延迟可能导致部分用户继续看到502,但影响会逐渐减小。
Rollback: 重新应用当前DNS更改(指向新NLB),但需先修复us-west-2 NLB缺失问题。
在us-west-2创建新NLB,并更新DNS记录指向该NLB
Risk: 创建NLB需要时间(约5-10分钟),期间故障持续。
Rollback: 删除新NLB并回滚DNS到旧ALB。
临时将us-west-2的api-gateway配置为直接使用旧ALB的IP地址(绕过DNS)
Risk: 手动配置IP可能因ALB IP变更而失效,需监控。
Rollback: 恢复DNS解析配置。
Customer impact
us-west-2区域约35%的用户请求失败,返回HTTP 502错误,受影响用户无法正常使用服务。p99延迟升至4秒,部分用户经历超时。us-east-1区域正常。预计在DNS回滚后30分钟内恢复。
Postmortem draft
摘要
[FILL IN] 2025-xx-xx us-west-2 API网关502故障
时间线(UTC)
- 21:40 DNS团队更改internal-services.example.com的CNAME和TTL
- 22:08 告警触发,502错误率35%
- 22:11 DNS团队确认更改
- [FILL IN] 回滚完成时间
影响
us-west-2区域35%请求失败,p99延迟4秒,持续约[X]分钟。
根因
DNS CNAME指向仅存在于us-east-1的NLB,导致us-west-2解析失败。
检测
告警在错误率上升后约8分钟触发,但DNS更改后28分钟才检测到。
响应
[FILL IN] 回滚决策和执行细节。
做得好的
- DNS团队及时确认更改
- 告警准确触发
做得不好的
- DNS更改未验证跨区域可用性
- 无自动化检查防止此类配置错误
行动项
- [ ] DNS更改需添加区域验证检查(P0)
- [ ] 增加DNS解析失败告警(P1)
- [ ] 回滚流程文档化(P2)
Follow-ups
- P0DNS更改流程增加区域覆盖验证步骤— 平台团队
- P1增加DNS解析失败告警(基于nginx upstream错误)— on-call SRE
- P1审查DNS TTL更改的变更管理流程— 平台团队
- P2编写DNS回滚标准操作流程— on-call SRE
- P2检查其他区域是否存在类似DNS配置问题— 平台团队
Similar past incidents
lexical match (pg_trgm)
- 70%
[Eval][v1][zh] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 70%
[Eval][v2][zh] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 46%
[Eval][v2][en] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 46%
[Eval][v1][en] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 18%
[Eval][v1][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts