AWS US-EAST-1 Regional Control Plane Outage
Incident Date: October 19-20, 2025
Duration: ~15 hours 12 minutes
Root Cause: DNS resolution failure for DynamoDB endpoints
Services Affected: 142 AWS services globally
Duration: ~15 hours 12 minutes
Root Cause: DNS resolution failure for DynamoDB endpoints
Services Affected: 142 AWS services globally
Overview
Between October 19, 2025 at 23:49 PDT and October 20, 2025 at 15:01 PDT, AWS experienced a major regional outage in US-EAST-1 (N. Virginia), resulting in elevated error rates and latencies across 142 services worldwide. The root cause was a DNS resolution failure fordynamodb.us-east-1.amazonaws.com, which cascaded into the EC2 control plane and Network Load Balancer health subsystems, affecting dependent services globally.
AWS confirmed full recovery by 15:01 PDT, with residual backlogs cleared over subsequent hours.
Root Cause Analysis
DNS Resolution Failure
Internal DNS systems in US-EAST-1 failed to resolve
dynamodb.us-east-1.amazonaws.com, making the DynamoDB control plane unreachable.EC2 Control Plane Impact
EC2’s internal control subsystems relied on DynamoDB tables for instance-launch metadata. Failed DNS lookups caused instance launch throttling and autoscaling failures.
NLB Health Check Failures
Network Load Balancer health check subsystems became impaired, breaking Lambda invocation paths and CloudWatch connectivity.
Primary Trigger
- DNS resolution failure for
dynamodb.us-east-1.amazonaws.com - DynamoDB API endpoints became unreachable within the region
- No redundancy in DNS resolution for critical control plane services
Cascading Impact
Compute Layer (EC2, ECS, EKS)
Compute Layer (EC2, ECS, EKS)
- Instance launches failed or were severely throttled
- Autoscaling groups couldn’t provision new capacity
- Container orchestration platforms couldn’t schedule new tasks
- Existing workloads continued running but couldn’t scale
Serverless Layer (Lambda, SQS)
Serverless Layer (Lambda, SQS)
- Lambda invocation errors due to NLB health check failures
- SQS event processing delays
- Step Functions executions stalled
- API Gateway intermittent failures
Storage Layer (DynamoDB, RDS, S3)
Storage Layer (DynamoDB, RDS, S3)
- DynamoDB API calls returning errors
- RDS instance launches throttled
- S3 control plane operations degraded (bucket creation, policy updates)
- Data plane operations generally continued
Networking Layer (NLB, VPC)
Networking Layer (NLB, VPC)
- Network Load Balancer health checks failed
- Transient connectivity losses
- VPC endpoint provisioning blocked
- Route 53 private hosted zones impacted
Observability Layer (CloudWatch, EventBridge)
Observability Layer (CloudWatch, EventBridge)
- Metric ingestion delays
- Event backlog accumulation
- Alarm evaluation delays
- Log ingestion latency
Identity Layer (IAM, STS, Organizations)
Identity Layer (IAM, STS, Organizations)
- Temporary credential propagation delays
- Some authentication requests failed
- Policy updates not propagating
- Cross-account role assumptions intermittent
Timeline
All times are in PDT (UTC-7)
| Time | Event Summary |
|---|---|
| Oct 19, 23:49 | Start of increased error rates for multiple AWS services in US-EAST-1 |
| Oct 20, 00:26 | AWS identified DNS resolution failures for DynamoDB endpoints |
| Oct 20, 02:24 | DNS issue resolved; early signs of recovery observed |
| Oct 20, 03:35 | Most services functional; EC2 instance launches still throttled |
| Oct 20, 08:43 | Identified impaired internal NLB health subsystem; mitigation in progress |
| Oct 20, 09:38 | NLB health checks restored; connectivity recovery begins |
| Oct 20, 10:00-15:00 | Gradual unthrottling of EC2, Lambda, and SQS operations |
| Oct 20, 15:01 | All AWS services confirmed fully operational; minor backlogs remain |
Impact Assessment
Scope
- Primary Region: US-EAST-1 (N. Virginia)
- Duration: ~15 hours 12 minutes
- Services Affected: 142 AWS services
- Global Impact: Yes (via global service dependencies)
Severity by Service Category
| Service Category | Impact Level | Details |
|---|---|---|
| Compute | 🔴 Critical | Instance launches failed/throttled; autoscaling impaired |
| Serverless | 🔴 Critical | Invocation errors; delayed SQS event processing |
| Storage | 🟡 High | DynamoDB API failures; S3 control plane degraded |
| Networking | 🔴 Critical | Health checks failed; transient connectivity loss |
| Monitoring | 🟡 High | Metric delays and event backlog |
| Identity | 🟡 High | Temporary propagation delays; some auth failures |
External Organizations Impacted
Consumer Apps
- Snapchat
- Roblox
- Various mobile games
Business Services
- Zoom video conferencing
- Numerous SaaS platforms
- B2B API services
Financial Services
- Lloyds Bank (UK)
- Various fintech platforms
- Payment processors
Government
- HMRC (UK tax authority)
- Various gov.uk services
- Public sector applications
Contributing Factors
Single-Region Control Plane Dependency
Services affected by this pattern:- IAM (Identity and Access Management)
- STS (Security Token Service)
- Organizations
- S3 control plane
- DynamoDB Global Tables metadata
Tight Service Coupling
Hidden DNS Dependency
The internal DNS layer was not isolated per-service, creating a shared dependency that amplified cross-impact:- Single DNS infrastructure serving all services
- No circuit breakers for DNS failures
- Services assumed DNS was always available
- No graceful degradation for DNS lookup failures
Assumed Control Plane Availability
Both AWS engineers and customers operated under the assumption that control plane APIs were always reachable. This incident proved that assumption false and highlighted the need for architectures that don’t require control plane access during recovery.
Architecture Diagrams
Simplified Impact Flow
Diagram Legend
Diagram Legend
| Color | Meaning |
|---|---|
| 🔴 Red/Pink | Root cause (DNS failure) |
| 🟡 Yellow | Regional impact (service degradation) |
| 🟢 Green | Global ripple effects (IAM, STS, S3) |
Full Impact - Fault Isolation View
Layered Impact Interpretation
| Layer | Summary | Key Insight |
|---|---|---|
| DNS / Resolver | Root cause – internal DNS failure in us-east-1 | Even internal AWS DNS failures can cripple region-wide operations |
| Regional Control Plane | DynamoDB, EC2 launch systems, IAM token propagation impacted | Inter-service dependency on DynamoDB for control metadata |
| Regional Data Plane | NLB, Lambda, CloudWatch, ECS/EKS, RDS throttling or lag | Control plane degradation causes autoscaling and observability failures |
| Global Services | IAM, STS, S3, DynamoDB Global Tables affected globally | ”Global” services still rely on US-EAST-1 control APIs |
Lessons Learned
For DevOps Teams
Separate Control and Data Planes
Avoid architectures that require control-plane API calls during recovery or scaling. Pre-provision resources before they’re needed.Action Items:
- Pre-create load balancers, buckets, and IAM roles
- Use infrastructure as code to version resource definitions
- Maintain “warm” standby resources in failover regions
Adopt Multi-Region Redundancy
Treat US-EAST-1 as a dependency risk, not a default region. Design for active-active or active-passive multi-region deployments.Action Items:
- Deploy critical workloads across multiple regions
- Use Route 53 health checks for automatic failover
- Test regional failover procedures quarterly
Use Regional Endpoints
Prefer regional STS endpoints over the global endpoint to avoid
us-east-1 reliance.Example:Test Control Plane Unavailability
Simulate “cannot create resources” scenarios in DR drills. Verify your application can survive without calling AWS APIs.Chaos Engineering Tests:
- Block all AWS API calls at the network level
- Test application behavior when autoscaling fails
- Verify manual recovery procedures work
For DevSecOps Teams
Document Fault Isolation Boundaries
Document Fault Isolation Boundaries
Understand and document which services have true regional isolation vs. hidden dependencies on US-EAST-1.Resources:
- AWS Fault Isolation Boundaries Whitepaper
- Create internal documentation mapping your critical paths
- Review AWS service quotas and limits across all regions
Implement Graceful Degradation
Implement Graceful Degradation
Design services to operate in reduced-functionality mode when control plane APIs are unavailable.Patterns:
- Cache IAM policy decisions locally
- Use long-lived credentials for emergency access
- Implement circuit breakers for AWS API calls
- Maintain local copies of critical configuration
Improve Observability
Improve Observability
Correlate CloudWatch and SNS events to detect multi-service failures early.Implementation:
For SRE Teams
- Architecture
- Monitoring
- Runbooks
Pre-Provision Failover ResourcesResources should already exist before an outage:
- DNS records and hosted zones
- S3 buckets for static content
- Load balancers in multiple regions
- IAM roles with trust relationships
- VPC endpoints for critical services
Key Takeaways
Control Plane Risk
AWS control plane APIs are not always available. Design for this reality.
US-EAST-1 Dependency
Many “global” services actually depend on US-EAST-1. This is a known architecture constraint.
Multi-Region Required
Single-region deployments are inherently fragile. Multi-region is not optional for critical workloads.
Pre-Provision Resources
Create failover infrastructure before you need it. Recovery is not the time to provision new resources.
References
- AWS Health Dashboard - US-EAST-1 Incident
- AWS Fault Isolation Boundaries Whitepaper
- BBC News Coverage
- Official AWS Event ARN
Disclaimer: This case study is for educational and analytical purposes. All information is based on publicly available sources and official Amazon communications.Author: Zepher Ashe
License: CC BY-NC-SA 4.0
Last Updated: November 21, 2025
License: CC BY-NC-SA 4.0
Last Updated: November 21, 2025