[Eval][v1][zh] Regional 5xx spike after DNS TTL change
service: api-gatewaycreated: 5/25/2026, 10:22:47 PM
Raw incident context
Time: 22:10 UTC. api-gateway in us-west-2 returning HTTP 502 for ~35% of requests. us-east-1 unaffected. Symptoms (us-west-2 only): - 502 rate: 35% (baseline 0.05%) - p99 latency: 4s (mostly DNS resolution timeouts in error trace) - nginx upstream errors: "upstream timed out (110: Connection timed out) while connecting to upstream" Recent changes: - 21:40 UTC — DNS team changed TTL for internal-services.example.com from 300s → 30s as part of a planned migration - Also changed CNAME target: internal-services.example.com now points to a NEW NLB (was pointing to a legacy ALB) - Old ALB still up and answering, but its DNS records were not removed - New NLB was created in us-east-1 only — no us-west-2 record exists Logs: - nginx in us-west-2 trying to resolve internal-services.example.com → getting NXDOMAIN or stale IPs - ~30% of pods have cached the old ALB IP and are working; rest are failing - us-east-1 pods all resolve correctly to the new NLB On-call: - 22:08 — pager - 22:11 — DNS team confirms the planned change - 22:12 — debate: revert DNS or hot-patch us-west-2 to point at us-east-1 NLB
Summary
api-gateway在us-west-2区域因DNS变更导致35%请求返回502错误,p99延迟升至4秒。DNS团队将internal-services.example.com的CNAME指向了仅存在于us-east-1的新NLB,us-west-2无对应记录,导致部分pod DNS解析失败。
Severity reasoning: 用户可见的35%请求失败,p99延迟4秒,影响生产流量,符合SEV1标准(用户面故障)。
deepseek-chat·prompt v1·output: zh·9420ms
Root cause hypotheses
- highDNS CNAME目标指向了仅存在于us-east-1的NLB,us-west-2无对应A记录,导致解析失败。
Evidence: nginx日志显示DNS解析超时,us-west-2 pod解析internal-services.example.com返回NXDOMAIN或旧IP。
- highDNS TTL从300s改为30s后,旧缓存快速过期,但新记录未正确传播到us-west-2的DNS解析器。
Evidence: 约30% pod仍使用旧ALB IP正常工作,其余失败,符合缓存过期时间差异。
- low新NLB的健康检查配置错误,导致us-west-2流量被拒绝。
Evidence: us-east-1正常,us-west-2无NLB记录,健康检查不适用。
Investigation checklist
- 确认us-west-2的DNS解析结果
kubectl exec -n ingress -c nginx -- nslookup internal-services.example.com 2>&1 | head -20Expected: 应返回us-west-2的NLB IP或旧ALB IP;若返回NXDOMAIN或超时,则确认DNS问题。
- 检查us-west-2的NLB是否存在
aws elbv2 describe-load-balancers --region us-west-2 --names new-nlb-name 2>&1Expected: 若返回LoadBalancerNotFound,则确认us-west-2无NLB。
- 检查nginx upstream配置
kubectl exec -n ingress -c nginx -- cat /etc/nginx/conf.d/default.conf | grep internal-servicesExpected: 应显示upstream指向internal-services.example.com:443。
- 验证us-east-1的DNS解析
kubectl exec -n ingress -c nginx -- nslookup internal-services.example.com 2>&1 | head -20Expected: 应返回us-east-1的NLB IP。
Mitigation plan
立即回滚DNS变更:将internal-services.example.com的CNAME改回旧ALB,并将TTL恢复为300s。
Risk: 回滚期间DNS传播延迟,可能短暂影响部分pod;但旧ALB仍在运行,风险低。
Rollback: 若回滚后问题未解决,可重新应用新DNS记录。
在us-west-2创建新的NLB,并将DNS记录指向该NLB。
Risk: 创建NLB需要时间,期间故障持续;新NLB配置可能错误。
Rollback: 删除新NLB,继续使用旧ALB。
Customer impact
us-west-2区域约35%用户请求失败,返回502错误,p99延迟升至4秒。受影响用户无法正常使用服务,预计在DNS回滚后15分钟内恢复。
Postmortem draft
事后复盘
摘要
us-west-2区域api-gateway因DNS变更导致35%请求失败。
时间线
- 21:40 UTC: DNS团队将internal-services.example.com的CNAME指向仅存在于us-east-1的新NLB。
- 22:10 UTC: 告警触发,502错误率上升。
- 22:12 UTC: 开始调查。
- 22:XX UTC: 回滚DNS变更,服务恢复。
影响
- 35%请求失败,p99延迟4s,持续约XX分钟。
根因
DNS CNAME目标指向了仅存在于us-east-1的NLB,us-west-2无对应记录,导致pod DNS解析失败。
做得好的
- 快速识别DNS变更为根因。
- 回滚操作迅速。
做得不好的
- DNS变更未跨区域验证。
- 变更未经过变更管理流程。
行动项
- [ ] 增加DNS变更的跨区域验证检查。
- [ ] 实施变更管理审批流程。
- [ ] 添加DNS解析监控。
Follow-ups
- P0增加DNS变更的跨区域验证检查,确保新记录在所有区域存在。— DNS团队
- P1实施变更管理审批流程,DNS变更需SRE团队审核。— 平台团队
- P1添加DNS解析失败监控,对internal-services.example.com进行主动探测。— on-call SRE
- P2更新runbook,包含DNS回滚步骤。— 服务所有者
Similar past incidents
lexical match (pg_trgm)
- 64%
[Eval][v2][zh] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 45%
[Eval][v2][en] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 45%
[Eval][v1][en] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 18%
[Eval][v1][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 18%
[Eval][v2][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts