← All incidents

[Eval][v1][en] Regional 5xx spike after DNS TTL change

service: api-gatewaycreated: 6/3/2026, 12:00:11 AM

Raw incident context

Time: 22:10 UTC. api-gateway in us-west-2 returning HTTP 502 for ~35% of requests. us-east-1 unaffected.

Symptoms (us-west-2 only):
- 502 rate: 35% (baseline 0.05%)
- p99 latency: 4s (mostly DNS resolution timeouts in error trace)
- nginx upstream errors: "upstream timed out (110: Connection timed out) while connecting to upstream"

Recent changes:
- 21:40 UTC — DNS team changed TTL for internal-services.example.com from 300s → 30s as part of a planned migration
- Also changed CNAME target: internal-services.example.com now points to a NEW NLB (was pointing to a legacy ALB)
- Old ALB still up and answering, but its DNS records were not removed
- New NLB was created in us-east-1 only — no us-west-2 record exists

Logs:
- nginx in us-west-2 trying to resolve internal-services.example.com → getting NXDOMAIN or stale IPs
- ~30% of pods have cached the old ALB IP and are working; rest are failing
- us-east-1 pods all resolve correctly to the new NLB

On-call:
- 22:08 — pager
- 22:11 — DNS team confirms the planned change
- 22:12 — debate: revert DNS or hot-patch us-west-2 to point at us-east-1 NLB

Summary

SEV1

A DNS change for internal-services.example.com at 21:40 UTC caused a partial outage in us-west-2: the new CNAME target (NLB) only exists in us-east-1, so ~65% of api-gateway pods in us-west-2 fail to resolve the new target and return 502 errors. The remaining 35% still have cached old ALB IPs and work. p99 latency spikes to 4s due to DNS timeouts.

Severity reasoning: User-facing outage: 35% of requests to api-gateway in us-west-2 return HTTP 502, impacting a significant subset of users. Revenue impact likely. Full outage of a critical system component (DNS resolution for internal service).

deepseek-chat·prompt v1·output: en·9893ms·1403↑ / 1433↓ tok·$0.00196

Root cause hypotheses

  • highDNS CNAME target points to an NLB that does not exist in us-west-2, causing NXDOMAIN for pods that flush their cache.

    Evidence: DNS team changed CNAME to new NLB at 21:40; new NLB only exists in us-east-1; nginx logs show NXDOMAIN or stale IPs for internal-services.example.com in us-west-2.

  • highTTL reduction from 300s to 30s caused rapid cache expiration, exposing the missing record in us-west-2.

    Evidence: TTL changed at same time as CNAME; 35% of pods still have old cached IP (working), rest fail; consistent with TTL-driven cache expiry.

  • lowDNS propagation delay: the new NLB record hasn't propagated to us-west-2 DNS resolvers yet.

    Evidence: us-west-2 pods get NXDOMAIN; us-east-1 works; but propagation should be fast for a CNAME change; less likely given 30 min elapsed.

Investigation checklist

  1. Verify DNS resolution from a failing pod in us-west-2.
    kubectl exec -n api-gateway <pod-name> -- nslookup internal-services.example.com

    Expected: NXDOMAIN or IP of new NLB (which is in us-east-1, not us-west-2).

  2. Check if the new NLB exists in us-west-2.
    aws elbv2 describe-load-balancers --region us-west-2 --names <new-nlb-name>

    Expected: Empty result or error indicating NLB not found in us-west-2.

  3. Check DNS records for internal-services.example.com.
    dig internal-services.example.com @<us-west-2-dns-resolver>

    Expected: CNAME pointing to new NLB DNS name (which resolves to us-east-1 IPs).

  4. Confirm old ALB still exists and is healthy.
    aws elbv2 describe-load-balancers --region us-west-2 --names <old-alb-name>

    Expected: ALB exists and is active.

Mitigation plan

  • Revert the DNS change: point internal-services.example.com back to the old ALB in us-west-2.

    Risk: Low risk; reverts to known working state. May cause brief propagation delay.

    Rollback: Re-apply the new CNAME if needed after fixing the NLB deployment.

  • Alternatively, create a new NLB in us-west-2 and update the CNAME to point to it (if NLB is required).

    Risk: Medium risk; requires creating NLB and updating DNS; may take longer.

    Rollback: Revert DNS to old ALB.

Customer impact

Approximately 35% of requests to api-gateway in us-west-2 are failing with HTTP 502 errors. Users may experience timeouts or errors when accessing services that route through api-gateway. us-east-1 users are unaffected. ETA: immediate mitigation via DNS revert expected within 5 minutes.

Postmortem draft

Postmortem: api-gateway 502 errors in us-west-2

Summary

Brief description of incident.

Timeline

  • 21:40 UTC: DNS team changes TTL and CNAME for internal-services.example.com
  • 22:08 UTC: Pager alert for 502 errors
  • 22:10 UTC: Incident declared
  • 22:12 UTC: Root cause identified
  • [Mitigation time]: DNS reverted

Impact

  • 35% of requests to api-gateway in us-west-2 returned 502
  • p99 latency 4s
  • Duration: [X minutes]

Root Cause

DNS CNAME pointed to an NLB that did not exist in us-west-2.

What Went Well

  • Monitoring detected the issue quickly
  • DNS team available to confirm change

What Went Poorly

  • No pre-deployment validation of DNS target existence in all regions
  • TTL reduction exacerbated impact

Action Items

  • [ ] Add pre-change validation for DNS targets (P0)
  • [ ] Implement canary DNS changes (P1)
  • [ ] Review TTL change process (P2)

Follow-ups

  • P0Add pre-change validation script to check DNS target existence in all regions.platform team
  • P1Implement canary DNS changes with gradual rollout.platform team
  • P1Review TTL change process and add approval for reductions below 60s.DNS team
  • P2Update runbook for DNS-related incidents with specific rollback steps.on-call SRE
  • P2Add monitoring for DNS resolution failures per region.platform team