← All incidents

[Eval][v2][zh] Regional 5xx spike after DNS TTL change

service: api-gatewaycreated: 6/3/2026, 12:03:36 AM

Raw incident context

Time: 22:10 UTC. api-gateway in us-west-2 returning HTTP 502 for ~35% of requests. us-east-1 unaffected.

Symptoms (us-west-2 only):
- 502 rate: 35% (baseline 0.05%)
- p99 latency: 4s (mostly DNS resolution timeouts in error trace)
- nginx upstream errors: "upstream timed out (110: Connection timed out) while connecting to upstream"

Recent changes:
- 21:40 UTC — DNS team changed TTL for internal-services.example.com from 300s → 30s as part of a planned migration
- Also changed CNAME target: internal-services.example.com now points to a NEW NLB (was pointing to a legacy ALB)
- Old ALB still up and answering, but its DNS records were not removed
- New NLB was created in us-east-1 only — no us-west-2 record exists

Logs:
- nginx in us-west-2 trying to resolve internal-services.example.com → getting NXDOMAIN or stale IPs
- ~30% of pods have cached the old ALB IP and are working; rest are failing
- us-east-1 pods all resolve correctly to the new NLB

On-call:
- 22:08 — pager
- 22:11 — DNS team confirms the planned change
- 22:12 — debate: revert DNS or hot-patch us-west-2 to point at us-east-1 NLB

Summary

SEV1

API网关在us-west-2区域出现35%的502错误和p99延迟4秒,原因是DNS团队在21:40 UTC将internal-services.example.com的CNAME指向了一个仅在us-east-1创建的新NLB,导致us-west-2的pod无法解析到有效IP。

Severity reasoning: 用户可见的故障,错误率35%远超1%阈值,持续超过5分钟,且影响核心API路径,符合SEV1标准。

deepseek-chat·prompt v2·output: zh·9976ms·2195↑ / 1309↓ tok·$0.00203

Root cause hypotheses

  • highDNS CNAME指向了不存在的us-west-2 NLB,导致解析失败

    Evidence: nginx日志显示'upstream timed out'和DNS解析超时;us-west-2无新NLB记录

  • mediumDNS TTL从300s改为30s导致缓存失效,但新记录未正确传播

    Evidence: 约30% pod使用旧ALB IP正常工作,其余失败;us-east-1正常

  • low新NLB的健康检查配置错误,导致us-west-2流量被拒绝

    Evidence: 无直接证据,但需排除;us-west-2 NLB不存在,此假设不成立

Investigation checklist

  1. 确认us-west-2的DNS解析结果
    kubectl exec -n prod -l app=api-gateway -- nslookup internal-services.example.com 2>/dev/null | head -20

    Expected: 应返回新NLB的IP;若返回NXDOMAIN或旧ALB IP,则确认DNS问题

  2. 检查新NLB是否在us-west-2存在
    aws elbv2 describe-load-balancers --region us-west-2 --names new-nlb-name 2>&1

    Expected: 应返回NLB信息;若报错'LoadBalancerNotFound',则确认us-west-2无NLB

  3. 检查us-west-2 nginx错误日志中的DNS超时比例
    kubectl logs -n prod -l app=api-gateway --since=30m | grep -c 'Connection timed out'

    Expected: 计数应接近总请求数的35%,确认DNS超时是502主因

  4. 验证us-east-1的DNS解析是否正常
    kubectl exec -n prod -l app=api-gateway --context=us-east-1 -- nslookup internal-services.example.com 2>/dev/null | head -20

    Expected: 应返回新NLB的IP,且无超时

Mitigation plan

  • 立即回滚DNS更改:将internal-services.example.com的CNAME指向旧ALB,TTL恢复为300s

    Risk: 回滚期间DNS传播延迟可能导致部分用户继续看到错误,但风险低

    Rollback: 重新应用当前DNS更改(指向新NLB)

  • 在us-west-2创建新NLB,并将DNS记录指向该NLB

    Risk: 创建NLB需要几分钟,期间故障持续;需确保安全组和健康检查正确

    Rollback: 删除us-west-2 NLB,回滚DNS到旧ALB

Customer impact

us-west-2区域约35%的API请求失败,用户收到502错误,延迟升高至4秒。影响该区域所有依赖API网关的服务。预计在DNS回滚后10-15分钟内恢复。

Postmortem draft

Summary

[FILL IN]

Timeline (UTC)

  • 21:40 - DNS团队更改internal-services.example.com的CNAME和TTL
  • 22:08 - 收到告警
  • 22:10 - 确认35% 502错误
  • 22:11 - DNS团队确认变更
  • 22:12 - 讨论回滚方案
  • [FILL IN] - 执行回滚
  • [FILL IN] - 错误率恢复

Impact

us-west-2区域35% API请求失败,p99延迟4秒,持续约[X]分钟。

Root Cause

DNS CNAME指向了仅在us-east-1创建的新NLB,导致us-west-2的pod无法解析到有效IP,产生大量502错误。

Detection

监控告警触发,错误率超过阈值。

Response

[FILL IN]

What Went Well

[FILL IN]

What Went Poorly

[FILL IN]

Action Items

[FILL IN]

Follow-ups

  • P0DNS变更流程增加跨区域验证步骤,确保新记录在所有区域存在平台团队
  • P1增加DNS解析失败告警,覆盖NXDOMAIN和超时on-call SRE
  • P1审查DNS TTL更改的审批流程,避免无验证的快速变更服务负责人
  • P2为API网关增加备用DNS解析逻辑,使用本地缓存或静态IP平台团队