[Eval][v2][zh] Regional 5xx spike after DNS TTL change
service: api-gatewaycreated: 5/25/2026, 10:23:20 PM
Raw incident context
Time: 22:10 UTC. api-gateway in us-west-2 returning HTTP 502 for ~35% of requests. us-east-1 unaffected. Symptoms (us-west-2 only): - 502 rate: 35% (baseline 0.05%) - p99 latency: 4s (mostly DNS resolution timeouts in error trace) - nginx upstream errors: "upstream timed out (110: Connection timed out) while connecting to upstream" Recent changes: - 21:40 UTC — DNS team changed TTL for internal-services.example.com from 300s → 30s as part of a planned migration - Also changed CNAME target: internal-services.example.com now points to a NEW NLB (was pointing to a legacy ALB) - Old ALB still up and answering, but its DNS records were not removed - New NLB was created in us-east-1 only — no us-west-2 record exists Logs: - nginx in us-west-2 trying to resolve internal-services.example.com → getting NXDOMAIN or stale IPs - ~30% of pods have cached the old ALB IP and are working; rest are failing - us-east-1 pods all resolve correctly to the new NLB On-call: - 22:08 — pager - 22:11 — DNS team confirms the planned change - 22:12 — debate: revert DNS or hot-patch us-west-2 to point at us-east-1 NLB
Summary
由于DNS团队在21:40 UTC将internal-services.example.com的CNAME指向仅部署在us-east-1的新NLB,导致us-west-2区域的api-gateway无法解析服务地址,35%的请求返回502错误,p99延迟升至4秒。约30%的pod因缓存旧ALB IP而正常工作,其余pod因NXDOMAIN或超时失败。
Severity reasoning: 用户可见的502错误率35%远超1%阈值,且持续时间超过5分钟,属于用户面中断,符合SEV1定义。
deepseek-chat·prompt v2·output: zh·11271ms
Root cause hypotheses
- highDNS CNAME记录指向了仅存在于us-east-1的NLB,us-west-2无对应记录,导致解析失败。
Evidence: nginx日志显示'upstream timed out (110: Connection timed out) while connecting to upstream',且DNS查询返回NXDOMAIN。
- mediumDNS TTL从300s改为30s导致缓存快速过期,但新记录未正确传播到us-west-2的DNS解析器。
Evidence: TTL变更发生在21:40 UTC,22:10仍有30% pod使用旧缓存,说明部分缓存未过期,但新记录未生效。
- lowus-west-2的api-gateway pod存在DNS解析器配置错误,无法查询到新NLB的A记录。
Evidence: us-east-1所有pod正常工作,说明DNS配置在us-east-1正确,但us-west-2可能使用了不同的上游DNS服务器。
Investigation checklist
- 检查us-west-2 api-gateway pod的DNS解析结果
kubectl exec -n prod -l app=api-gateway -- nslookup internal-services.example.com 2>/dev/null | head -20Expected: 应返回新NLB的IP地址(us-west-2区域),若返回NXDOMAIN或旧ALB IP则确认问题。
- 验证新NLB在us-west-2是否存在
aws elbv2 describe-load-balancers --region us-west-2 --names new-nlb-name 2>&1Expected: 若返回'LoadBalancerNotFound'则确认NLB未部署在us-west-2。
- 检查us-west-2 api-gateway pod的DNS缓存状态
kubectl exec -n prod -l app=api-gateway -- cat /etc/resolv.confExpected: 确认DNS服务器配置,并检查缓存时间(如使用nscd)。
- 对比us-east-1和us-west-2的DNS解析结果
kubectl exec -n prod -l app=api-gateway -- nslookup internal-services.example.com 2>/dev/null | grep AddressExpected: us-east-1应返回新NLB IP,us-west-2应返回旧ALB IP或NXDOMAIN。
Mitigation plan
立即回滚DNS变更:将internal-services.example.com的CNAME指向旧的us-west-2 ALB,并恢复TTL为300s。
Risk: 回滚期间DNS传播延迟可能导致部分用户继续看到502,但影响范围会逐渐缩小。无数据丢失风险。
Rollback: 重新应用当前错误的DNS记录(指向us-east-1 NLB),但需先确保us-west-2 NLB已部署。
在us-west-2创建新的NLB,并将DNS记录指向该NLB。
Risk: 创建NLB需要几分钟,期间502错误持续。需确保安全组和健康检查配置正确。
Rollback: 删除新创建的NLB,并回滚DNS到旧ALB。
Customer impact
us-west-2区域约35%的用户请求失败,返回HTTP 502错误,受影响用户估计占全球用户的15-20%。其余用户正常。预计在DNS回滚后15分钟内恢复。
Postmortem draft
Summary
由于DNS团队在21:40 UTC将internal-services.example.com的CNAME指向仅部署在us-east-1的新NLB,导致us-west-2区域的api-gateway无法解析服务地址,35%的请求返回502错误,p99延迟升至4秒。约30%的pod因缓存旧ALB IP而正常工作,其余pod因NXDOMAIN或超时失败。
Timeline (UTC)
- 21:40 - DNS团队变更internal-services.example.com的CNAME和TTL
- 22:08 - Pager触发
- 22:10 - 确认502错误率35%
- 22:11 - DNS团队确认变更
- 22:12 - 讨论回滚方案
- [FILL IN] - DNS回滚完成
- [FILL IN] - 错误率恢复
Impact
- us-west-2区域35%请求失败,持续约[FILL IN]分钟
- p99延迟从基线[FILL IN]升至4s
- 估计受影响用户数:[FILL IN]
Root Cause
DNS CNAME记录指向了仅存在于us-east-1的NLB,us-west-2无对应记录,导致解析失败。
Detection
- 监控告警:502错误率超过阈值触发pager
- 检测延迟:从变更到告警约28分钟
Response
- 22:08 - 接收告警
- 22:11 - 确认DNS变更为根因
- [FILL IN] - 执行DNS回滚
What Went Well
- 监控告警及时
- DNS团队快速响应
What Went Poorly
- DNS变更未验证多区域部署
- 变更未通知SRE团队
Action Items
- [FILL IN] 增加DNS变更的自动化验证步骤
- [FILL IN] 建立多区域DNS部署检查清单
Follow-ups
- P0增加DNS变更的自动化验证步骤,确保CNAME目标在所有区域存在— 平台团队
- P1建立多区域DNS部署检查清单,要求变更前验证所有区域— SRE团队
- P2优化告警阈值,减少检测延迟— 监控团队
- P1审查DNS变更流程,增加SRE审批环节— 平台团队
Similar past incidents
lexical match (pg_trgm)
- 64%
[Eval][v1][zh] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 44%
[Eval][v1][en] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 43%
[Eval][v2][en] Regional 5xx spike after DNS TTL change
us-west-2 region: 35% 502 errors, p99 4s. us-east-1: normal. New DNS record deployed 30min before incident.
- 17%
[Eval][v1][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts
- 17%
[Eval][v2][en] Payment service connection pool exhaustion after batch job deploy
p99 latency 4.8s (up from 120ms), 12% 500 error rate, customers report failed checkouts