← All incidents

[Eval][v3][en] Regional 5xx spike after DNS TTL change

service: api-gatewaycreated: 6/2/2026, 11:40:28 PM

Raw incident context

Time: 22:10 UTC. api-gateway in us-west-2 returning HTTP 502 for ~35% of requests. us-east-1 unaffected.

Symptoms (us-west-2 only):
- 502 rate: 35% (baseline 0.05%)
- p99 latency: 4s (mostly DNS resolution timeouts in error trace)
- nginx upstream errors: "upstream timed out (110: Connection timed out) while connecting to upstream"

Recent changes:
- 21:40 UTC — DNS team changed TTL for internal-services.example.com from 300s → 30s as part of a planned migration
- Also changed CNAME target: internal-services.example.com now points to a NEW NLB (was pointing to a legacy ALB)
- Old ALB still up and answering, but its DNS records were not removed
- New NLB was created in us-east-1 only — no us-west-2 record exists

Logs:
- nginx in us-west-2 trying to resolve internal-services.example.com → getting NXDOMAIN or stale IPs
- ~30% of pods have cached the old ALB IP and are working; rest are failing
- us-east-1 pods all resolve correctly to the new NLB

On-call:
- 22:08 — pager
- 22:11 — DNS team confirms the planned change
- 22:12 — debate: revert DNS or hot-patch us-west-2 to point at us-east-1 NLB

Summary

SEV1

A DNS change at 21:40 UTC for internal-services.example.com (TTL reduced from 300s to 30s, CNAME target changed to a new NLB) caused a regional outage in us-west-2. The new NLB was only created in us-east-1, so us-west-2 pods fail to resolve the new target, resulting in 35% 502 errors and elevated latency. The old ALB still exists but its DNS records were removed, causing NXDOMAIN for pods that don't have cached old IPs.

Severity reasoning: User-facing outage: 35% error rate (exceeds 1% threshold) for >5 minutes, affecting a major region (us-west-2). Revenue path likely broken due to high error rate. This meets SEV1 criteria.

deepseek-chat·prompt v3·output: en·10540ms·2035↑ / 1585↓ tok·$0.00229

Root cause hypotheses

  • highNew NLB not deployed in us-west-2, causing DNS resolution failures for pods that flush cache

    Evidence: Logs show nginx in us-west-2 getting NXDOMAIN or stale IPs for internal-services.example.com. us-east-1 works because NLB exists there.

  • highDNS TTL reduction from 300s to 30s caused rapid cache expiration, exposing the missing NLB in us-west-2

    Evidence: Change occurred at 21:40 UTC; incident started around 22:10 UTC, consistent with TTL expiration window. 30% of pods still working (cached old ALB IP).

  • mediumOld ALB DNS records were removed, preventing fallback resolution

    Evidence: Context states 'its DNS records were not removed' but later says 'NXDOMAIN' which implies removal. Contradiction; likely records were removed or not published.

  • lowDNS propagation delay causing inconsistent resolution across pods

    Evidence: Some pods have cached old IPs and work; others fail. This is consistent with gradual cache expiration, not propagation delay.

Investigation checklist

  1. Verify DNS resolution for internal-services.example.com from us-west-2 pods
    kubectl exec -n prod -l app=api-gateway -- nslookup internal-services.example.com 2>&1 | head -20

    Expected: Should return NXDOMAIN or an IP not in us-west-2 (e.g., us-east-1 NLB IP). If old ALB IP appears, cache is stale.

  2. Check if new NLB exists in us-west-2
    aws elbv2 describe-load-balancers --region us-west-2 --names new-nlb-name 2>&1

    Expected: Should return 'LoadBalancerNotFound' if not deployed.

  3. Check nginx error logs for upstream connection failures
    kubectl logs -n prod -l app=api-gateway --since=30m | grep -i 'upstream timed out' | tail -20

    Expected: Multiple entries showing connection timeouts to upstream.

  4. Check DNS TTL configuration for internal-services.example.com
    dig internal-services.example.com +short +ttlid 2>&1

    Expected: TTL should be 30s (new value) or 300s (old).

  5. Verify old ALB is still operational
    aws elbv2 describe-target-health --region us-west-2 --target-group-arn <old-alb-tg-arn> 2>&1

    Expected: Targets should be healthy if ALB is still up.

Mitigation plan

  • Revert DNS change: point internal-services.example.com back to old ALB CNAME with original TTL (300s)

    Risk: Low risk; old ALB is still up. May cause brief propagation delay.

    Rollback: Re-apply the new DNS change after fixing NLB deployment.

  • Deploy new NLB in us-west-2 and update DNS to point to it

    Risk: Medium risk; takes time to create NLB and update DNS. During creation, outage continues.

    Rollback: Revert DNS to old ALB if new NLB fails.

  • Hot-patch us-west-2 pods to use us-east-1 NLB IP directly via /etc/hosts or environment variable

    Risk: Medium risk; cross-region latency increase, potential for asymmetric routing. Requires pod restart.

    Rollback: Remove the patch and revert to DNS resolution.

Customer impact

Approximately 35% of requests to us-west-2 are failing with HTTP 502 errors. Users experience timeouts or errors when accessing services. us-east-1 users are unaffected. No ETA yet.

Postmortem draft

Summary

A planned DNS migration caused a regional outage in us-west-2 due to missing NLB deployment in that region.

Timeline (UTC)

  • 21:40 - DNS team changes TTL and CNAME for internal-services.example.com
  • 22:08 - Pager alerts for elevated 502 errors
  • 22:10 - Incident declared
  • [FILL IN] - Mitigation applied

Impact

  • 35% error rate in us-west-2 for ~[FILL IN] minutes
  • p99 latency 4s
  • [FILL IN] users affected

Root Cause

New NLB was only created in us-east-1. DNS change caused us-west-2 pods to fail resolution after cache expired.

Detection

Monitoring alerted on 502 error rate spike.

Response

  • [FILL IN] actions taken

What Went Well

  • Monitoring detected quickly
  • DNS team available for context

What Went Poorly

  • No pre-deployment verification of NLB in all regions
  • DNS change not coordinated with infrastructure team

Action Items

  • [FILL IN]

Follow-ups

  • P0Ensure new NLB is deployed in all regions before DNS cutoverinfrastructure team
  • P1Add pre-change validation checklist for DNS migrationsplatform team
  • P1Improve monitoring to detect regional DNS resolution failuresobservability team
  • P2Review DNS change process to require multi-region verificationSRE team