← All incidents

[Eval][v3][en] Regional 5xx spike after DNS TTL change

service: api-gatewaycreated: 6/3/2026, 12:04:37 AM

Raw incident context

Time: 22:10 UTC. api-gateway in us-west-2 returning HTTP 502 for ~35% of requests. us-east-1 unaffected.

Symptoms (us-west-2 only):
- 502 rate: 35% (baseline 0.05%)
- p99 latency: 4s (mostly DNS resolution timeouts in error trace)
- nginx upstream errors: "upstream timed out (110: Connection timed out) while connecting to upstream"

Recent changes:
- 21:40 UTC — DNS team changed TTL for internal-services.example.com from 300s → 30s as part of a planned migration
- Also changed CNAME target: internal-services.example.com now points to a NEW NLB (was pointing to a legacy ALB)
- Old ALB still up and answering, but its DNS records were not removed
- New NLB was created in us-east-1 only — no us-west-2 record exists

Logs:
- nginx in us-west-2 trying to resolve internal-services.example.com → getting NXDOMAIN or stale IPs
- ~30% of pods have cached the old ALB IP and are working; rest are failing
- us-east-1 pods all resolve correctly to the new NLB

On-call:
- 22:08 — pager
- 22:11 — DNS team confirms the planned change
- 22:12 — debate: revert DNS or hot-patch us-west-2 to point at us-east-1 NLB

Summary

SEV1

A DNS change at 21:40 UTC for internal-services.example.com (TTL reduced from 300s to 30s, CNAME pointed to a new NLB created only in us-east-1) caused 35% of requests to api-gateway in us-west-2 to fail with HTTP 502 due to DNS resolution failures. The new NLB has no record in us-west-2, so pods that resolved the new target get NXDOMAIN or stale IPs, while ~30% of pods with cached old ALB IPs continue to work. us-east-1 is unaffected.

Severity reasoning: User-facing outage: 35% error rate (502) for >5 minutes (baseline 0.05%), exceeding the SEV1 threshold of >1% error rate for >5 min. Revenue path is broken for affected users in us-west-2.

deepseek-chat·prompt v3·output: en·15382ms·2035↑ / 2048↓ tok·$0.00280

Root cause hypotheses

  • highDNS change misconfiguration: new NLB created only in us-east-1, causing us-west-2 pods to fail DNS resolution for internal-services.example.com.

    Evidence: Logs show nginx in us-west-2 getting NXDOMAIN or stale IPs when resolving internal-services.example.com. DNS team confirmed the change at 21:40 UTC. New NLB exists only in us-east-1.

  • highStale DNS cache in some pods: pods with cached old ALB IPs continue to work, while others fail due to TTL reduction causing rapid cache expiration.

    Evidence: ~30% of pods have cached the old ALB IP and are working; rest are failing. TTL was reduced from 300s to 30s, accelerating cache turnover.

  • lowDNS propagation delay: the new NLB's DNS record hasn't propagated to us-west-2 DNS resolvers yet.

    Evidence: us-west-2 pods get NXDOMAIN for internal-services.example.com. However, the new NLB was never created in us-west-2, so propagation is not the issue.

  • lowNLB health check failure: new NLB in us-east-1 is unhealthy, causing DNS to return no records.

    Evidence: us-east-1 pods resolve correctly to the new NLB and work fine, so the NLB is healthy.

Investigation checklist

  1. Verify DNS resolution for internal-services.example.com from us-west-2 pods.
    kubectl exec -n prod -l app=api-gateway -- dig +short internal-services.example.com @<us-west-2-dns-resolver>

    Expected: Should return an IP address. Currently likely returns NXDOMAIN or an old IP.

  2. Check if the new NLB exists in us-west-2.
    aws elbv2 describe-load-balancers --region us-west-2 --names internal-services-nlb

    Expected: Should return load balancer details. Currently likely returns empty or error indicating no NLB in us-west-2.

  3. Check nginx upstream error logs for DNS resolution failures.
    kubectl logs -n prod -l app=api-gateway --since=30m | grep -E 'upstream timed out|connection timed out' | tail -20

    Expected: Should show 'upstream timed out (110: Connection timed out) while connecting to upstream' with hostname internal-services.example.com.

  4. Verify that the old ALB is still operational in us-west-2.
    aws elbv2 describe-target-health --region us-west-2 --target-group-arn <old-alb-tg-arn>

    Expected: Should show healthy targets. If old ALB is still up, it can be used as a fallback.

  5. Check DNS TTL for internal-services.example.com.
    dig +nocmd +noall +answer internal-services.example.com | awk '{print $2}'

    Expected: Should show TTL=30 (current) or TTL=300 (if reverted).

Mitigation plan

  • Revert DNS change: point internal-services.example.com back to the old ALB in us-west-2 and restore TTL to 300s.

    Risk: Low risk; old ALB is still operational. May cause brief DNS propagation delay.

    Rollback: Re-apply the new DNS record (CNAME to new NLB) if needed.

  • Create a new NLB in us-west-2 and update DNS to point to it (if revert is not desired).

    Risk: Medium risk: requires creating NLB and updating DNS; may take several minutes. If misconfigured, could worsen outage.

    Rollback: Revert DNS to old ALB.

  • Hot-patch us-west-2 pods to use a static IP or a working DNS resolver that returns the old ALB IP.

    Risk: Medium risk: manual patching may cause inconsistency; pods may need restart. Rollback by reverting the patch.

    Rollback: Revert the patch and restart pods.

Customer impact

Approximately 35% of users in the US West (Oregon) region are receiving HTTP 502 errors when accessing the API gateway. Affected users see 'Bad Gateway' errors and experience timeouts. Users in other regions are unaffected. Estimated 35% of traffic in us-west-2 is failing.

Postmortem draft

Postmortem: API Gateway 502 Errors in us-west-2

Summary

On [DATE], a DNS change for internal-services.example.com caused 35% of requests to api-gateway in us-west-2 to fail with HTTP 502. The new DNS target (a new NLB) was only created in us-east-1, leaving us-west-2 pods unable to resolve the hostname. The incident lasted [DURATION] and affected [NUMBER] users.

Timeline (UTC)

  • 21:40 - DNS team changes TTL from 300s to 30s and updates CNAME to new NLB (us-east-1 only).
  • 22:08 - Pager alerts on elevated 502 errors.
  • 22:10 - On-call engineer identifies DNS as root cause.
  • 22:12 - Debate on mitigation strategy.
  • [TIME] - DNS reverted to old ALB.
  • [TIME] - Error rate returns to baseline.

Impact

  • 35% error rate (502) for api-gateway in us-west-2.
  • p99 latency increased to 4s.
  • [NUMBER] users affected.

Root Cause

DNS change misconfiguration: new NLB was created only in us-east-1, causing us-west-2 pods to fail DNS resolution. TTL reduction from 300s to 30s accelerated cache expiration, causing more pods to fail quickly.

Detection

Pager alert on error rate threshold. On-call engineer correlated with recent DNS change.

Response

  • [TIME] - Incident declared.
  • [TIME] - DNS team contacted.
  • [TIME] - DNS reverted.
  • [TIME] - All clear.

What Went Well

  • Quick detection and correlation with DNS change.
  • DNS team responsive.

What Went Poorly

  • DNS change not validated in all regions before deployment.
  • No pre-change review process for DNS changes.

Action Items

  • [ ] Implement pre-change validation for DNS changes (e.g., ensure new targets exist in all regions).
  • [ ] Add monitoring for DNS resolution failures per region.
  • [ ] Review change management process for DNS modifications.
  • [ ] Consider using a service mesh or internal load balancer with regional awareness.

Follow-ups

  • P0Implement pre-change validation for DNS changes to ensure new targets exist in all required regions.platform team
  • P1Add monitoring for DNS resolution failures per region and per service.observability team
  • P1Review change management process for DNS modifications; require approval from SRE.SRE team
  • P2Evaluate using a service mesh (e.g., Istio) to handle internal traffic with regional awareness.platform team
  • P2Create a runbook for DNS-related incidents with specific rollback steps.on-call SRE