Skip to main content

AWS US-EAST-1 Regional Control Plane Outage

Incident Date: October 19-20, 2025
Duration: ~15 hours 12 minutes
Root Cause: DNS resolution failure for DynamoDB endpoints
Services Affected: 142 AWS services globally

Overview

Between October 19, 2025 at 23:49 PDT and October 20, 2025 at 15:01 PDT, AWS experienced a major regional outage in US-EAST-1 (N. Virginia), resulting in elevated error rates and latencies across 142 services worldwide.
The outage demonstrated that many AWS “global” services still rely on US-EAST-1 control APIs, creating a hidden single point of dependency that affected customers globally, including major SaaS platforms like Snapchat, Roblox, Zoom, Lloyds Bank, and HMRC.
The root cause was a DNS resolution failure for dynamodb.us-east-1.amazonaws.com, which cascaded into the EC2 control plane and Network Load Balancer health subsystems, affecting dependent services globally. AWS confirmed full recovery by 15:01 PDT, with residual backlogs cleared over subsequent hours.

Root Cause Analysis

1

DNS Resolution Failure

Internal DNS systems in US-EAST-1 failed to resolve dynamodb.us-east-1.amazonaws.com, making the DynamoDB control plane unreachable.
2

EC2 Control Plane Impact

EC2’s internal control subsystems relied on DynamoDB tables for instance-launch metadata. Failed DNS lookups caused instance launch throttling and autoscaling failures.
3

NLB Health Check Failures

Network Load Balancer health check subsystems became impaired, breaking Lambda invocation paths and CloudWatch connectivity.
4

Global Service Degradation

Global services (IAM, STS, S3 control plane) experienced degraded API operations due to their architectural dependencies on US-EAST-1 control plane operations.

Primary Trigger

  • DNS resolution failure for dynamodb.us-east-1.amazonaws.com
  • DynamoDB API endpoints became unreachable within the region
  • No redundancy in DNS resolution for critical control plane services

Cascading Impact

  • Instance launches failed or were severely throttled
  • Autoscaling groups couldn’t provision new capacity
  • Container orchestration platforms couldn’t schedule new tasks
  • Existing workloads continued running but couldn’t scale
  • Lambda invocation errors due to NLB health check failures
  • SQS event processing delays
  • Step Functions executions stalled
  • API Gateway intermittent failures
  • DynamoDB API calls returning errors
  • RDS instance launches throttled
  • S3 control plane operations degraded (bucket creation, policy updates)
  • Data plane operations generally continued
  • Network Load Balancer health checks failed
  • Transient connectivity losses
  • VPC endpoint provisioning blocked
  • Route 53 private hosted zones impacted
  • Metric ingestion delays
  • Event backlog accumulation
  • Alarm evaluation delays
  • Log ingestion latency
  • Temporary credential propagation delays
  • Some authentication requests failed
  • Policy updates not propagating
  • Cross-account role assumptions intermittent

Timeline

All times are in PDT (UTC-7)
TimeEvent Summary
Oct 19, 23:49Start of increased error rates for multiple AWS services in US-EAST-1
Oct 20, 00:26AWS identified DNS resolution failures for DynamoDB endpoints
Oct 20, 02:24DNS issue resolved; early signs of recovery observed
Oct 20, 03:35Most services functional; EC2 instance launches still throttled
Oct 20, 08:43Identified impaired internal NLB health subsystem; mitigation in progress
Oct 20, 09:38NLB health checks restored; connectivity recovery begins
Oct 20, 10:00-15:00Gradual unthrottling of EC2, Lambda, and SQS operations
Oct 20, 15:01All AWS services confirmed fully operational; minor backlogs remain
Notice the 6+ hour recovery period after the initial DNS fix. This highlights how cascading failures create complex recovery scenarios requiring careful, phased restoration.

Impact Assessment

Scope

  • Primary Region: US-EAST-1 (N. Virginia)
  • Duration: ~15 hours 12 minutes
  • Services Affected: 142 AWS services
  • Global Impact: Yes (via global service dependencies)

Severity by Service Category

Service CategoryImpact LevelDetails
Compute🔴 CriticalInstance launches failed/throttled; autoscaling impaired
Serverless🔴 CriticalInvocation errors; delayed SQS event processing
Storage🟡 HighDynamoDB API failures; S3 control plane degraded
Networking🔴 CriticalHealth checks failed; transient connectivity loss
Monitoring🟡 HighMetric delays and event backlog
Identity🟡 HighTemporary propagation delays; some auth failures

External Organizations Impacted

Consumer Apps

  • Snapchat
  • Roblox
  • Various mobile games

Business Services

  • Zoom video conferencing
  • Numerous SaaS platforms
  • B2B API services

Financial Services

  • Lloyds Bank (UK)
  • Various fintech platforms
  • Payment processors

Government

  • HMRC (UK tax authority)
  • Various gov.uk services
  • Public sector applications

Contributing Factors

Single-Region Control Plane Dependency

Many AWS global services anchor their control plane operations in US-EAST-1. This architectural decision creates a logical single point of failure for worldwide operations.
Services affected by this pattern:
  • IAM (Identity and Access Management)
  • STS (Security Token Service)
  • Organizations
  • S3 control plane
  • DynamoDB Global Tables metadata

Tight Service Coupling

Hidden DNS Dependency

The internal DNS layer was not isolated per-service, creating a shared dependency that amplified cross-impact:
  • Single DNS infrastructure serving all services
  • No circuit breakers for DNS failures
  • Services assumed DNS was always available
  • No graceful degradation for DNS lookup failures

Assumed Control Plane Availability

Both AWS engineers and customers operated under the assumption that control plane APIs were always reachable. This incident proved that assumption false and highlighted the need for architectures that don’t require control plane access during recovery.

Architecture Diagrams

Simplified Impact Flow

ColorMeaning
🔴 Red/PinkRoot cause (DNS failure)
🟡 YellowRegional impact (service degradation)
🟢 GreenGlobal ripple effects (IAM, STS, S3)

Full Impact - Fault Isolation View

Layered Impact Interpretation

LayerSummaryKey Insight
DNS / ResolverRoot cause – internal DNS failure in us-east-1Even internal AWS DNS failures can cripple region-wide operations
Regional Control PlaneDynamoDB, EC2 launch systems, IAM token propagation impactedInter-service dependency on DynamoDB for control metadata
Regional Data PlaneNLB, Lambda, CloudWatch, ECS/EKS, RDS throttling or lagControl plane degradation causes autoscaling and observability failures
Global ServicesIAM, STS, S3, DynamoDB Global Tables affected globally”Global” services still rely on US-EAST-1 control APIs

Lessons Learned

For DevOps Teams

1

Separate Control and Data Planes

Avoid architectures that require control-plane API calls during recovery or scaling. Pre-provision resources before they’re needed.Action Items:
  • Pre-create load balancers, buckets, and IAM roles
  • Use infrastructure as code to version resource definitions
  • Maintain “warm” standby resources in failover regions
2

Adopt Multi-Region Redundancy

Treat US-EAST-1 as a dependency risk, not a default region. Design for active-active or active-passive multi-region deployments.Action Items:
  • Deploy critical workloads across multiple regions
  • Use Route 53 health checks for automatic failover
  • Test regional failover procedures quarterly
3

Use Regional Endpoints

Prefer regional STS endpoints over the global endpoint to avoid us-east-1 reliance.Example:
# Bad: Uses us-east-1 by default
aws sts assume-role --endpoint-url https://sts.amazonaws.com

# Good: Uses regional endpoint
aws sts assume-role --endpoint-url https://sts.eu-west-1.amazonaws.com
4

Test Control Plane Unavailability

Simulate “cannot create resources” scenarios in DR drills. Verify your application can survive without calling AWS APIs.Chaos Engineering Tests:
  • Block all AWS API calls at the network level
  • Test application behavior when autoscaling fails
  • Verify manual recovery procedures work

For DevSecOps Teams

Understand and document which services have true regional isolation vs. hidden dependencies on US-EAST-1.Resources:
Design services to operate in reduced-functionality mode when control plane APIs are unavailable.Patterns:
  • Cache IAM policy decisions locally
  • Use long-lived credentials for emergency access
  • Implement circuit breakers for AWS API calls
  • Maintain local copies of critical configuration
Correlate CloudWatch and SNS events to detect multi-service failures early.Implementation:
# Monitor for correlated service failures
def detect_regional_outage(events):
    failing_services = [e for e in events if e['status'] == 'degraded']
    if len(failing_services) > 5:  # Multiple services down
        alert('Possible regional outage detected')

For SRE Teams

Pre-Provision Failover ResourcesResources should already exist before an outage:
  • DNS records and hosted zones
  • S3 buckets for static content
  • Load balancers in multiple regions
  • IAM roles with trust relationships
  • VPC endpoints for critical services
Don’t rely on the AWS control plane to be available when you need it most.

Key Takeaways

Control Plane Risk

AWS control plane APIs are not always available. Design for this reality.

US-EAST-1 Dependency

Many “global” services actually depend on US-EAST-1. This is a known architecture constraint.

Multi-Region Required

Single-region deployments are inherently fragile. Multi-region is not optional for critical workloads.

Pre-Provision Resources

Create failover infrastructure before you need it. Recovery is not the time to provision new resources.

References

Disclaimer: This case study is for educational and analytical purposes. All information is based on publicly available sources and official Amazon communications.Author: Zepher Ashe
License: CC BY-NC-SA 4.0
Last Updated: November 21, 2025

Build docs developers (and LLMs) love