The Cloudflare Outage of November 18, 2025: A Deep Dive into What Went Wrong and Why
On November 18, 2025, Cloudflare experienced its worst outage since 2019, disrupting core network traffic for millions of websites. This comprehensive analysis explores the root cause, technical breakdown, impact, and critical lessons learned from this infrastructure failure.
Codence Studio Team
Infrastructure & DevOps Experts
Published
November 20, 2025
Quick Summary
- Duration: 5 hours 46 minutes (11:20 UTC - 17:06 UTC)
- Impact: Millions of websites globally, including X, ChatGPT, Discord, and major gaming platforms
- Root Cause: Database permission change caused duplicate entries in Bot Management feature file, exceeding 200-feature limit
- Key Lesson: Even internally generated configuration files need validation, and error handling must be graceful, not catastrophic
On November 18, 2025, at 11:20 UTC, the internet experienced one of its most significant infrastructure failures in recent years. Cloudflare's network—which protects and accelerates approximately 20% of the internet—began experiencing catastrophic failures that disrupted core traffic delivery across millions of websites worldwide. This wasn't a cyber attack, a DDoS assault, or a malicious breach. It was something far more insidious: a cascading failure triggered by a seemingly minor database permission change that exposed critical vulnerabilities in Cloudflare's configuration management system.
Within minutes, major platforms like X (Twitter), ChatGPT, Discord, Spotify, and countless gaming services became inaccessible. Users worldwide saw HTTP 500 error pages instead of their favorite websites. Even Downdetector—the site that tracks outages—went offline because it, too, relies on Cloudflare.
This incident serves as a stark reminder of how complex distributed systems can fail in unexpected ways, and why robust error handling, input validation, and graceful degradation are essential for modern infrastructure. In this comprehensive analysis, we'll explore exactly what happened, why it happened, and what we can all learn from this critical infrastructure failure.
Table of Contents
The Incident: What Happened?
The outage began when Cloudflare's network started returning HTTP 5xx error codes to users attempting to access websites protected by their CDN and security services. The error page displayed to users indicated a failure within Cloudflare's network itself, not the origin servers. This was Cloudflare's worst outage since 2019, affecting core traffic routing—something that hadn't happened in over six years.
The Scale of Impact: By the Numbers
Millions
Websites affected globally
~20%
Of internet traffic handled by Cloudflare
5h 46m
Total outage duration
6+ Years
Since last major core traffic outage
Key Timeline
Root Cause Analysis: The Technical Breakdown
The Cascade of Failures: A Simple Explanation
Think of Cloudflare's network like a massive security checkpoint at an airport. Every request (like a passenger) needs to be checked before it can reach its destination (the website). The Bot Management system is like the security scanner that determines if a request is legitimate or potentially malicious.
The outage was triggered by a seemingly innocent change: Cloudflare updated database permissions in their ClickHouse database cluster. This was part of routine maintenance to improve security, but it had an unintended side effect. The database query that generates the Bot Management "feature file" (think of it as the security scanner's rulebook) began outputting duplicate entries.
What is a Feature File?
The Bot Management system uses machine learning to detect bots. A "feature file" is like a rulebook that tells the ML model what to look for—things like request patterns, timing, browser fingerprints, etc. This file is refreshed every five minutes and distributed to all of Cloudflare's servers worldwide to keep up with evolving bot threats.
When the database query started generating duplicates, the feature file doubled in size. Imagine if your security scanner's rulebook suddenly had every rule written twice—it would become too large to process efficiently.
The Size Limit That Broke Everything
Here's where the problem became critical. The software running on Cloudflare's proxy servers (called "FL" for Frontline) had a hard limit: it could only handle 200 features maximum. This limit existed for a good reason—memory is preallocated for features to optimize performance. Think of it like a parking lot with exactly 200 spaces. The system normally used about 60 features, so there was plenty of room.
The Numbers That Matter
- Normal operation: ~60 features (well below the 200 limit)
- After database change: File doubled in size, exceeding 200 features
- Result: System couldn't process the oversized file
When the oversized file with more than 200 features was distributed to servers, the limit was exceeded. The Rust code in FL2 tried to load the file, hit the limit, and panicked (crashed) with this error:
thread fl2_worker_thread panicked: called Result::unwrap() on an Err valueThis panic (crash) meant the proxy couldn't process any requests. Instead of gracefully handling the error (like showing a warning and using a backup configuration), the entire system failed, resulting in HTTP 5xx errors being returned to users.
Why This Was So Bad
The code used unwrap(), which is a Rust function that crashes the program if something goes wrong. In production infrastructure code, this is considered an anti-pattern. The code should have handled the error gracefully—perhaps by falling back to a previous configuration or disabling the feature temporarily rather than crashing the entire proxy.
Why the Fluctuating Behavior? The Mystery That Delayed Diagnosis
One of the most unusual and frustrating aspects of this incident was the fluctuating behavior. The system would appear to recover, only to fail again minutes later. This made diagnosis extremely difficult and initially led Cloudflare's team to suspect a DDoS attack.
Here's why this happened: The feature file is regenerated every five minutes by a query running on Cloudflare's ClickHouse database cluster. At the time of the incident, the database cluster was being gradually updated with new permissions—not all servers had the new permissions yet.
The Five-Minute Cycle of Chaos
- Every 5 minutes, a query runs to generate the feature file
- If the query runs on a database server with the new permissions → bad file (duplicates) → system fails
- If the query runs on a database server with old permissions → good file → system recovers
- This created a pattern: fail → recover → fail → recover, every 5 minutes
- Eventually, all database servers got the new permissions → system stayed in failed state
This fluctuating pattern made it look like an attack—the system would recover briefly, then fail again. It wasn't until Cloudflare's team realized that the failures correlated with the feature file generation cycle that they identified the root cause.
Real-World Impact: Major Services That Went Down
The outage had a cascading effect that reached far beyond Cloudflare's own services. Because Cloudflare protects and accelerates millions of websites, when their network failed, it took down some of the internet's most popular platforms. Here's what users experienced:
Major Platforms Affected
Social Media & Communication:
- X (formerly Twitter)
- Facebook/Meta services
- Discord
AI & Productivity:
- ChatGPT (OpenAI)
- Claude AI (Anthropic)
- Canva
- Spotify
Gaming Platforms:
- League of Legends
- Fortnite
- Valorant
- Other multiplayer games
Other Services:
- Shopify
- bet365
- Vinted
- Banking apps
- Crypto platforms
- Public transit apps (NJ Transit, SNCF)
Even Downdetector, the website that tracks outages, went offline because it relies on Cloudflare!
What Users Saw: The Error Experience
When users tried to access affected websites, they encountered HTTP 500 Internal Server Error pages. These errors indicated that Cloudflare's network itself was failing, not the origin servers behind it. For millions of users worldwide, this meant:
- Unable to load websites or web applications
- Login failures and authentication errors
- Gaming servers becoming unreachable
- API failures affecting mobile apps
- E-commerce sites unable to process transactions
- Social media platforms becoming inaccessible
Impact Assessment: Cloudflare Services Breakdown
Beyond the customer-facing websites, Cloudflare's own services were also severely impacted:
Core CDN and Security Services
HTTP 5xx status codes were returned to end users, making websites inaccessible. The error page shown at the top of this post was the typical experience for users during the incident.
Cloudflare Turnstile
Turnstile, Cloudflare's CAPTCHA alternative, failed to load, preventing users from completing verification challenges.
Workers KV
Workers KV returned significantly elevated HTTP 5xx errors as requests to KV's front-end gateway failed due to the core proxy failure. A patch at 13:04 allowed Workers KV to bypass the core proxy, reducing downstream impact.
Cloudflare Dashboard
While the dashboard was mostly operational, most users were unable to log in due to Turnstile being unavailable on the login page. This created a compounding effect where users couldn't access their dashboards to check status or make configuration changes.
Cloudflare Access
Authentication failures were widespread for most users, beginning at the start of the incident and continuing until the rollback was initiated. All failed authentication attempts resulted in an error page, meaning users never reached target applications during authentication failures.
Email Security
While email processing and delivery were unaffected, there was a temporary loss of access to an IP reputation source, which reduced spam-detection accuracy and prevented some new-domain-age detections from triggering.
Understanding the Technical Chain of Events
To understand why this outage was so severe, let's break down the technical chain of events in simple terms:
The Failure Chain (Step by Step)
Database Permission Change
Cloudflare updates permissions on ClickHouse database cluster (routine maintenance)
Query Behavior Changes
Database query starts generating duplicate entries in feature file
File Size Doubles
Feature file grows from ~60 features to 200+ features (exceeds limit)
System Panic
Proxy software hits 200-feature limit, panics, and crashes
Global Failure
Millions of websites return HTTP 500 errors to users worldwide
Why This Happened: The Deeper Technical Issues
Lack of Input Validation
One of the critical failures was that Cloudflare's systems didn't validate internally generated configuration files with the same rigor as user-generated input. The feature file was treated as trusted data, even though it was generated dynamically from database queries that could change behavior.
Unhandled Error Conditions
The Rust code in FL2 used unwrap() on a Result type, which panics when encountering an error. This is a common Rust anti-pattern that should be avoided in production code, especially in critical infrastructure. The code should have handled the error gracefully, perhaps by falling back to a previous configuration or disabling the feature rather than crashing the entire proxy.
Insufficient Kill Switches
Cloudflare lacked sufficient global kill switches to quickly disable features when they began causing problems. While they were able to stop the propagation of the bad file, having more granular controls could have allowed them to disable Bot Management entirely while fixing the underlying issue, rather than having to wait for the bad file to be replaced.
Resource Exhaustion from Error Handling
During the incident, Cloudflare observed significant increases in response latency. This was caused by their debugging and observability systems automatically enhancing uncaught errors with additional debugging information. The large volume of errors overwhelmed system resources, creating a secondary performance impact beyond the initial failures.
The Resolution: How Cloudflare Fixed It
The resolution process involved several key steps:
Initial Misdiagnosis (11:32-13:05 UTC)
The team initially suspected a large-scale DDoS attack due to the unusual fluctuating behavior and elevated error rates. This delayed proper diagnosis as they investigated attack patterns rather than internal configuration issues.
Workers KV Bypass (13:04 UTC)
A patch was implemented to allow Workers KV to bypass the core proxy, reducing downstream impact on services dependent on KV. This was a critical mitigation that helped restore some services.
Root Cause Identification (13:37 UTC)
The team correctly identified that the Bot Management configuration file was the trigger for the incident. This breakthrough came after hours of investigation and analysis of the fluctuating error patterns.
Stopping Bad File Generation (14:24 UTC)
Cloudflare stopped the automatic creation and propagation of new Bot Management configuration files. This prevented further bad files from being distributed across the network.
Rollback and Restart (14:24-14:30 UTC)
A known good version of the configuration file was manually inserted into the feature file distribution queue, and the core proxy was forced to restart. This was the critical fix that restored core functionality.
Full Recovery (14:30-17:06 UTC)
Core traffic began flowing normally by 14:30 UTC. The remaining time was spent restarting downstream services and ensuring all systems were fully operational. All services were restored by 17:06 UTC.
Lessons Learned: What We Can All Take Away
1. Trust No Input, Even Internal Input
Configuration files, even when generated internally, should be validated with the same rigor as user input. Size limits, format validation, and sanity checks should be applied to all configuration data, regardless of its source.
2. Handle Errors Gracefully
Using unwrap() or similar panic-inducing patterns in production code is dangerous. Errors should be handled gracefully with fallback mechanisms, circuit breakers, or feature flags that allow systems to degrade gracefully rather than fail catastrophically.
3. Implement Global Kill Switches
Critical features should have global kill switches that can be activated instantly to disable problematic functionality without requiring code deployments or configuration changes. This allows rapid mitigation while root cause analysis continues.
4. Monitor Error Handling Overhead
Observability and debugging systems should have resource limits to prevent them from overwhelming systems during incidents. Error handling should be designed to fail fast and fail safe, not fail loudly.
5. Gradual Rollouts Need Gradual Monitoring
When rolling out changes gradually (like the database permissions update), ensure that monitoring and alerting can detect issues that only affect a subset of the infrastructure. The fluctuating nature of this incident made it harder to diagnose because the system appeared to recover periodically.
6. Test Failure Modes, Not Just Happy Paths
Comprehensive testing should include failure mode analysis. What happens when configuration files are too large? What happens when database queries return unexpected results? These edge cases should be tested and handled explicitly.
Cloudflare's Commitment to Improvement
Cloudflare has committed to several improvements based on this incident:
- Hardening configuration file ingestion: Treating Cloudflare-generated configuration files with the same validation rigor as user-generated input
- Enabling more global kill switches: Implementing comprehensive feature disable mechanisms for rapid incident response
- Eliminating resource exhaustion: Ensuring that error reports and core dumps cannot overwhelm system resources
- Reviewing failure modes: Conducting thorough reviews of error conditions across all core proxy modules
The Bigger Picture: What This Teaches Us About Internet Infrastructure
This outage highlights a critical reality of the modern internet: we're more interconnected and dependent on centralized services than ever before. When Cloudflare (which protects approximately 20% of the internet) goes down, the impact is felt globally.
Key Statistics About Cloudflare's Reach
- Network size: 300+ cities in 100+ countries
- Traffic volume: Handles trillions of requests per month
- Customer base: Millions of websites and applications
- Market share: One of the largest CDN and security providers
The Centralization Problem
This incident serves as a reminder of the risks of centralization. When a single provider handles such a large portion of internet traffic, their failures become everyone's problem. However, it also demonstrates why companies choose Cloudflare—their scale and reliability are generally exceptional, which is why outages like this are so rare and newsworthy.
What Businesses Can Learn
For businesses relying on third-party infrastructure services, this outage reinforces the importance of:
- Diversification: Consider using multiple CDN providers or having fallback options
- Monitoring: Implement comprehensive monitoring to detect issues quickly
- Communication: Have a plan for communicating with users during outages
- Incident Response: Prepare for scenarios where critical third-party services fail
Key Takeaways: What Every Developer and Business Should Know
Essential Takeaways
Validate All Input
Even internally generated configuration files need validation. Trust no input, regardless of source.
Handle Errors Gracefully
Never use panic-inducing patterns like unwrap() in production. Always have fallback mechanisms.
Implement Kill Switches
Global kill switches allow instant feature disabling without code deployments during incidents.
Monitor Error Overhead
Observability systems need resource limits to prevent overwhelming systems during failures.
Test Failure Modes
Don't just test happy paths. Test edge cases, oversized inputs, and unexpected failures.
Diversify Infrastructure
Avoid single points of failure. Use multiple providers for critical services when possible.
Conclusion: The Importance of Resilience
The Cloudflare outage of November 18, 2025, serves as a powerful reminder that even the most sophisticated infrastructure can fail in unexpected ways. It highlights the importance of:
- •Defense in depth: Multiple layers of validation and error handling
- •Graceful degradation: Systems that fail safely rather than catastrophically
- •Rapid response mechanisms: Kill switches and rollback procedures that can be activated instantly
- •Comprehensive testing: Including failure modes and edge cases, not just happy paths
- •Transparent communication: Detailed post-mortems that help the entire industry learn and improve
For developers and infrastructure engineers, this incident reinforces the critical importance of robust error handling, input validation, and system design that assumes things will go wrong. The best systems aren't those that never fail—they're those that fail gracefully and recover quickly.
Cloudflare's transparency in sharing this detailed post-mortem is commendable and provides valuable lessons for the entire technology industry. As we build increasingly complex distributed systems, learning from incidents like this helps us all build more resilient infrastructure.
Final Thought
The November 18, 2025 Cloudflare outage wasn't just a technical failure—it was a masterclass in how small changes can have massive consequences in distributed systems. It reminds us that in our interconnected digital world, the reliability of one service can impact millions. As developers, engineers, and business leaders, we must design systems that assume failure is inevitable and build resilience into every layer of our infrastructure.
Frequently Asked Questions
What caused the Cloudflare outage on November 18, 2025?
The outage was caused by a database permission change that led to duplicate entries in Cloudflare's Bot Management feature file. This doubled the file size, exceeding the 200-feature limit in the proxy software, causing it to crash and return HTTP 500 errors.
How long did the Cloudflare outage last?
The outage began at 11:20 UTC on November 18, 2025, and core traffic was largely restored by 14:30 UTC. All systems were fully operational by 17:06 UTC, meaning the total duration was approximately 5 hours and 46 minutes.
Which websites were affected by the Cloudflare outage?
Major platforms affected included X (Twitter), ChatGPT, Facebook, Discord, Spotify, Canva, League of Legends, Fortnite, Shopify, and many other websites using Cloudflare's CDN and security services. The outage impacted millions of websites globally.
Was the Cloudflare outage a cyber attack?
No, Cloudflare confirmed that the outage was not caused by a cyber attack or malicious activity. It was an internal configuration error triggered by a database permission change that had unintended consequences.
What is Cloudflare's Bot Management system?
Bot Management is a machine learning-based system that Cloudflare uses to distinguish between legitimate users and automated bots. It uses a feature file containing rules and patterns that help identify bot behavior, which is updated every five minutes to keep up with evolving threats.
How can businesses protect themselves from similar outages?
Businesses should implement redundancy by using multiple CDN providers, have comprehensive monitoring systems, maintain incident response plans, and ensure their infrastructure can handle failures gracefully. Diversification of critical services is key to resilience.
References: This analysis is based on Cloudflare's official post-mortem published on their blog. We recommend reading the full technical details for a complete understanding of the incident.
Related Topics: Infrastructure & Reliability · Backend Development Services · CI/CD Setup · Cloud Deployment · Monitoring & Logging
Last Updated: November 20, 2025 | Sources: Cloudflare Official Post-Mortem
Need reliable infrastructure for your applications?
Codence Studio specializes in building robust, scalable infrastructure solutions that ensure high availability and performance. Let's discuss how we can help you build resilient systems.
Contact Us