Cloudflare Global Outage
Incident Date: November 18, 2025
Duration: ~6 hours
Root Cause: Malformed bot-management configuration file
Scope: Global, affecting millions of websites
Duration: ~6 hours
Root Cause: Malformed bot-management configuration file
Scope: Global, affecting millions of websites
Overview
On November 18, 2025, Cloudflare experienced its most severe global outage since 2019. A malformed, auto-generated configuration file caused Cloudflare’s new Rust-based FL2 proxy to panic, resulting in global 5xx Internal Server Errors across millions of domains. This incident highlights the catastrophic impact that configuration management failures can have on global internet infrastructure, and demonstrates how tightly coupled systems can amplify single points of failure.Root Cause Analysis
Primary Technical Cause
A ClickHouse permissions update unintentionally allowed duplicate rows to appear in a query used to generate Cloudflare’s Bot Management feature file.Permission Change
A ClickHouse database permissions update was made, likely for routine access control purposes. This change was classified as low-risk internal work.
Duplicate Data Introduced
The permission change inadvertently allowed a query to return duplicate rows. The query was used to generate the bot-management feature file.
File Size Exceeded Limit
The bot-management feature file normally contained <200 features. Duplicated entries doubled the file size, exceeding a hard-coded limit in the FL2 proxy.
Proxy Panic
When FL2 attempted to load the oversized file:
- The proxy encountered an unexpected state
- A
Result::unwrap()call on anErrvariant caused a panic - Proxies entered a restart loop, unable to serve traffic
The Technical Details
Why Result::unwrap() Failed
Why Result::unwrap() Failed
In Rust,
Result<T, E> represents either success (Ok(T)) or failure (Err(E)). The .unwrap() method extracts the success value but panics if called on an error.Why Duplicates Weren't Detected
Why Duplicates Weren't Detected
The bot-management pipeline lacked several critical safeguards:
- No schema validation on the generated file
- No duplicate detection in the query results
- No file size monitoring or alerting
- No checksum verification before propagation
- No pre-deployment validation gates
Propagation Mechanism
Propagation Mechanism
Configuration files propagated to edge nodes on a fixed schedule:
- Interval: Every ~5 minutes
- No validation: Files pushed without runtime verification
- No canary: All nodes received updates simultaneously
- Mixed state: Some nodes had good config, others had bad
Why It Cascaded Globally
The outage cascaded because:- Over-coupled components: Bot-management → proxy → Workers → Dashboard formed a single failure domain
- Automated propagation: Bad config spread before detection
- Internal tool dependency: Engineers couldn’t access Dashboard (needed Turnstile, which was behind failing proxies)
- No graceful degradation: Proxy panic caused complete failure, not reduced functionality
Timeline
All times are in UTC (Coordinated Universal Time)
| Time | Event |
|---|---|
| 11:05 | ClickHouse permissions changed → duplicates introduced in query |
| 11:28 | Initial global 5xx errors detected across edge network |
| 11:32 | Error rate escalates; monitoring systems trigger alerts |
| 12:00 | Outage severity increases; bot-management file continues propagation |
| 13:05 | Cloudflare applies initial mitigations (bypass Workers KV & Access) |
| 14:24 | Configuration propagation halted; investigation continues |
| 14:30 | Known-good configuration identified and deployment begins |
| 17:06 | Full global restoration declared |
Impact Assessment
Global Internet Impact
High-Profile Services Affected
AI & Social Media
ChatGPT (OpenAI)
- Chat interface unavailable
- API requests failing
- Timeline loading errors
- Image/video upload failures
Cryptocurrency
Multiple Exchanges
- Trading interfaces down
- Price data unavailable
- Deposit/withdrawal blocked
SaaS Platforms
Business Applications
- CRM systems
- Collaboration tools
- Project management
- Customer support portals
Mobile Applications
API-Dependent Apps
- News aggregators
- Gaming platforms
- Social media clients
- Content delivery
Cloudflare Internal Impact
Dashboard Login Failed
Dashboard Login Failed
The Cloudflare Dashboard authentication relies on Turnstile (Cloudflare’s CAPTCHA replacement), which sits behind the same FL2 proxy layer that was failing.Impact:
- Engineers couldn’t log in to management interface
- Configuration changes had to be made through alternate paths
- Internal tooling access significantly delayed
- Coordination difficulties during incident response
Workers KV Unavailable
Workers KV Unavailable
Mixed Configuration States
Mixed Configuration States
Different edge nodes had different configuration versions:
- Some nodes had the “good” file
- Others had the faulty one
- User experience varied by edge location
- Debugging complicated by inconsistent state
- Rollback required careful sequencing
Contributing Factors
Hard-Coded Feature Limit
Better Approach:Lack of Configuration Validation
The feature file generation pipeline had no:- Size limits: No monitoring of file growth
- Duplicate detection: Duplicate entries not flagged
- Schema validation: No enforcement of expected structure
- Checksum verification: No integrity checks before deployment
- Pre-deployment testing: No staging environment validation
Automated, Timed Propagation
- Current (Problematic)
- Recommended
Fixed-Schedule Propagation
- Every 5 minutes, automatically push new config
- No validation before propagation
- All nodes receive update simultaneously
- No rollback mechanism
Over-Coupled System Components
Bot-management → proxy → Workers → Dashboard formed a single blast radius:Notice the circular dependency: Dashboard → Turnstile → Proxy. When Proxy failed, engineers couldn’t access Dashboard to fix the issue.
- Provide alternate access path to Dashboard (direct origin, VPN)
- Don’t put internal tooling behind the same infrastructure as customer traffic
- Implement circuit breakers to prevent cascading failures
- Design for graceful degradation (disable bot-management if config fails)
Internal Change Misclassified as Low-Risk
The ClickHouse permissions change was likely:- Classified as low-risk routine maintenance
- Not subject to full change control process
- Not tested in staging environment
- Approved quickly without deep review
Mitigation & Recovery Actions
Cloudflare’s Response
Halt Propagation
First priority: stop the bad configuration from spreading further
- Disabled automatic config propagation
- Prevented new updates from reaching edge nodes
- Bought time for investigation
Identify Good Configuration
Located the last known good feature file
- Reviewed configuration history
- Identified version before duplicates
- Validated file integrity
Deploy Known-Good Configuration
Pushed corrected configuration globally
- Used emergency deployment process
- Bypassed normal propagation schedule
- Prioritized critical edge locations
Bypass Failing Components
Temporarily disabled dependencies to restore access
- Bypassed Workers KV for internal tools
- Created alternate Dashboard access path
- Enabled emergency authentication bypass
Restart Proxy Fleet
Restarted FL2 proxies with verified configuration
- Sequenced restarts to prevent traffic spikes
- Monitored error rates during restart
- Verified health checks passing
Residual Behavior
- Latency fluctuations persisted briefly as caches warmed up
- Dependent services recovered at different speeds based on node propagation order
- Some DNS caches held stale failure information temporarily
- Customer applications needed to retry failed requests
Lessons Learned
1. Treat All Config Surfaces as High-Risk
Requirements for Configuration Systems:Schema Validation
- Define strict schemas for all configs
- Validate before generation
- Enforce data types and constraints
- Reject invalid configurations
Size Limits
- Monitor file sizes
- Alert on unexpected growth
- Enforce maximum limits
- Track historical trends
Duplication Detection
- Check for duplicate entries
- Validate query result uniqueness
- Alert on data anomalies
- Log generation statistics
Checksums
- Generate checksums for all configs
- Verify integrity before deployment
- Compare against expected values
- Track checksum history
2. Fallback Paths Prevent Catastrophic Failure
A bot-management failure should degrade service, not collapse core traffic.Graceful Degradation Pattern
Graceful Degradation Pattern
- Never panic on config errors
- Always have safe defaults
- Log errors for investigation
- Degrade functionality, don’t fail completely
3. Propagation Must Be Controlled and Observable
Canary Nodes
Deploy to small subset first:
- 1-5% of fleet initially
- Geographically diverse
- Mix of high and low traffic nodes
- Isolated from production critical paths
Validation Gates
Automatic quality checks:
- Error rate within baseline ±10%
- Latency P95 not increased >20%
- No proxy panics or restarts
- Health checks passing
Automatic Rollback
Trigger rollback if:
- Error rate exceeds threshold
- Proxy restart loop detected
- Health checks fail
- Manual emergency stop
4. Monitor Assumptions—Not Just Metrics
Alert On:File Size Anomalies
Duplicate Feature Count
Proxy Panic Loops
Blast Radius Changes
5. Internal Changes Need Staging and Risk Assessment
Security and metadata changes can create unintended side effects. The ClickHouse permissions change was “internal” but had global customer impact.
| Change Type | Risk Level | Required Actions |
|---|---|---|
| Database permissions | 🟡 Medium | Staging test, query impact analysis |
| Schema changes | 🔴 High | Full regression testing, rollback plan |
| Config generation logic | 🔴 High | Canary deployment, monitoring |
| Proxy code changes | 🔴 Critical | Multi-stage rollout, feature flags |
6. Access Paths Must Survive Outages
Solutions:- Alternate Access Paths
- Emergency Bypass
- Tooling Independence
- Direct origin access (bypass CDN)
- VPN to internal management network
- Out-of-band management interfaces
- SSH/console access to core systems
Key Takeaways
Configuration is Code
Treat all configuration with the same rigor as application code: version control, testing, validation, and gradual rollout.
Graceful Degradation
Systems should degrade gracefully, not fail catastrophically. A bot-management error shouldn’t bring down the entire proxy.
Decoupling is Critical
Avoid circular dependencies between infrastructure components. Engineers must be able to access tools during outages.
Internal Changes are Risky
“Internal” changes can have external impact. Database permissions, schema changes, and metadata updates need full change control.
References
Disclaimer: This case study is for educational and analytical purposes. All information is based on publicly available sources and official Cloudflare communications.Author: Zepher Ashe
License: CC BY-NC-SA 4.0
Last Updated: November 21, 2025
License: CC BY-NC-SA 4.0
Last Updated: November 21, 2025