[Eval][v2][zh] Regional 5xx spike after DNS TTL change
service: api-gatewaycreated: 6/3/2026, 12:03:36 AM
Raw incident context
Time: 22:10 UTC. api-gateway in us-west-2 returning HTTP 502 for ~35% of requests. us-east-1 unaffected. Symptoms (us-west-2 only): - 502 rate: 35% (baseline 0.05%) - p99 latency: 4s (mostly DNS resolution timeouts in error trace) - nginx upstream errors: "upstream timed out (110: Connection timed out) while connecting to upstream" Recent changes: - 21:40 UTC — DNS team changed TTL for internal-services.example.com from 300s → 30s as part of a planned migration - Also changed CNAME target: internal-services.example.com now points to a NEW NLB (was pointing to a legacy ALB) - Old ALB still up and answering, but its DNS records were not removed - New NLB was created in us-east-1 only — no us-west-2 record exists Logs: - nginx in us-west-2 trying to resolve internal-services.example.com → getting NXDOMAIN or stale IPs - ~30% of pods have cached the old ALB IP and are working; rest are failing - us-east-1 pods all resolve correctly to the new NLB On-call: - 22:08 — pager - 22:11 — DNS team confirms the planned change - 22:12 — debate: revert DNS or hot-patch us-west-2 to point at us-east-1 NLB
Summary
API网关在us-west-2区域出现35%的502错误和p99延迟4秒,原因是DNS团队在21:40 UTC将internal-services.example.com的CNAME指向了一个仅在us-east-1创建的新NLB,导致us-west-2的pod无法解析到有效IP。
Severity reasoning: 用户可见的故障,错误率35%远超1%阈值,持续超过5分钟,且影响核心API路径,符合SEV1标准。
deepseek-chat·prompt v2·output: zh·9976ms·2195↑ / 1309↓ tok·$0.00203
Root cause hypotheses
- highDNS CNAME指向了不存在的us-west-2 NLB,导致解析失败
Evidence: nginx日志显示'upstream timed out'和DNS解析超时;us-west-2无新NLB记录
- mediumDNS TTL从300s改为30s导致缓存失效,但新记录未正确传播
Evidence: 约30% pod使用旧ALB IP正常工作,其余失败;us-east-1正常
- low新NLB的健康检查配置错误,导致us-west-2流量被拒绝
Evidence: 无直接证据,但需排除;us-west-2 NLB不存在,此假设不成立
Investigation checklist
- 确认us-west-2的DNS解析结果
kubectl exec -n prod -l app=api-gateway -- nslookup internal-services.example.com 2>/dev/null | head -20Expected: 应返回新NLB的IP;若返回NXDOMAIN或旧ALB IP,则确认DNS问题
- 检查新NLB是否在us-west-2存在
aws elbv2 describe-load-balancers --region us-west-2 --names new-nlb-name 2>&1Expected: 应返回NLB信息;若报错'LoadBalancerNotFound',则确认us-west-2无NLB
- 检查us-west-2 nginx错误日志中的DNS超时比例
kubectl logs -n prod -l app=api-gateway --since=30m | grep -c 'Connection timed out'Expected: 计数应接近总请求数的35%,确认DNS超时是502主因
- 验证us-east-1的DNS解析是否正常
kubectl exec -n prod -l app=api-gateway --context=us-east-1 -- nslookup internal-services.example.com 2>/dev/null | head -20Expected: 应返回新NLB的IP,且无超时
Mitigation plan
立即回滚DNS更改:将internal-services.example.com的CNAME指向旧ALB,TTL恢复为300s
Risk: 回滚期间DNS传播延迟可能导致部分用户继续看到错误,但风险低
Rollback: 重新应用当前DNS更改(指向新NLB)
在us-west-2创建新NLB,并将DNS记录指向该NLB
Risk: 创建NLB需要几分钟,期间故障持续;需确保安全组和健康检查正确
Rollback: 删除us-west-2 NLB,回滚DNS到旧ALB
Customer impact
us-west-2区域约35%的API请求失败,用户收到502错误,延迟升高至4秒。影响该区域所有依赖API网关的服务。预计在DNS回滚后10-15分钟内恢复。
Postmortem draft
Summary
[FILL IN]
Timeline (UTC)
- 21:40 - DNS团队更改internal-services.example.com的CNAME和TTL
- 22:08 - 收到告警
- 22:10 - 确认35% 502错误
- 22:11 - DNS团队确认变更
- 22:12 - 讨论回滚方案
- [FILL IN] - 执行回滚
- [FILL IN] - 错误率恢复
Impact
us-west-2区域35% API请求失败,p99延迟4秒,持续约[X]分钟。
Root Cause
DNS CNAME指向了仅在us-east-1创建的新NLB,导致us-west-2的pod无法解析到有效IP,产生大量502错误。
Detection
监控告警触发,错误率超过阈值。
Response
[FILL IN]
What Went Well
[FILL IN]
What Went Poorly
[FILL IN]
Action Items
[FILL IN]
Follow-ups
- P0DNS变更流程增加跨区域验证步骤,确保新记录在所有区域存在— 平台团队
- P1增加DNS解析失败告警,覆盖NXDOMAIN和超时— on-call SRE
- P1审查DNS TTL更改的审批流程,避免无验证的快速变更— 服务负责人
- P2为API网关增加备用DNS解析逻辑,使用本地缓存或静态IP— 平台团队
Similar past incidents
lexical match (pg_trgm)
- 65%
[Eval][v2][zh] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 64%
[Eval][v1][zh] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 47%
[Eval][v2][en] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 45%
[Eval][v1][en] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 18%
[Eval][v1][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts