GN
GlobalNews.one
Technology

Cloudflare Outage Highlights Internet's Fragile Backbone: Detailed Postmortem Reveals Root Cause and Lessons Learned

November 20, 2025
Sponsored
Cloudflare Outage Highlights Internet's Fragile Backbone: Detailed Postmortem Reveals Root Cause and Lessons Learned

On a recent Tuesday, a significant portion of the internet experienced disruption as a six-hour outage at Cloudflare, one of the world's leading content delivery networks (CDNs), impacted thousands of websites. The incident served as a stark reminder of the centralized nature of much of the internet's infrastructure and the potential for cascading failures when core services falter. Among the sites affected were prominent platforms, demonstrating the widespread reach of Cloudflare's services. Notably, even X, formerly Twitter, was affected, highlighting the dependency of even large tech companies on CDN providers.

Within hours of resolving the outage, Cloudflare CEO Matthew Prince published a comprehensive postmortem, detailing the root cause and the steps taken to mitigate the issue. The root cause was traced back to a database permissions change in ClickHouse, Cloudflare's data warehouse, intended to improve system security by migrating from a shared system account to individual user accounts. This seemingly innocuous change inadvertently altered the behavior of queries used by Cloudflare's Bot Management module, causing it to fetch a significantly larger set of features than it was designed to handle.

The Bot Management module, responsible for identifying and mitigating malicious bot traffic, had a hard-coded limit of 200 features due to performance considerations. When the module attempted to load the unexpectedly large feature set, it triggered a system panic and crashed the affected edge nodes. The corrupted configuration file was being propagated to edge nodes every five minutes, leading to an increasing number of nodes crashing. Making troubleshooting difficult was the seemingly random nature of the failures, exacerbated by a simultaneous, unrelated issue that brought down Cloudflare's status page, leading engineers to initially suspect a coordinated botnet attack.

The Cloudflare team took 2.5 hours to pinpoint the incorrect configuration file as the source of the outage. The propagation of new files was stopped, and a corrected configuration file was created and deployed 3.5 hours after the start of the incident. Cleanup took another 2.5 hours and the outage was resolved after 6 hours. Cloudflare's transparent communication and rapid postmortem stand in stark contrast to some other large tech companies, like AWS, which took three days to release even a high-level overview of a recent outage. AWS's postmortem also lacked granular details about the underlying cause.

Several key learnings emerged from the Cloudflare incident. Firstly, the importance of explicit error logging was emphasized. Had the code that generated the error also logged it, the root cause might have been identified much faster. Secondly, global database changes carry inherent risks, as even seemingly minor adjustments can have unintended consequences across complex systems. Lastly, the incident highlighted how multiple, simultaneous issues can significantly complicate troubleshooting efforts.

The incident also raises broader questions about the internet's reliance on CDNs and the potential for single points of failure. While CDNs offer significant benefits in terms of performance, scalability, and security, they also introduce a dependency that can have significant consequences when outages occur. Companies that rely on CDNs should consider strategies for mitigating the impact of outages, such as utilizing multiple CDNs or having the ability to quickly redirect traffic to their origin servers, despite the added costs and complexities involved.

Downdetector, a service that tracks website and service outages, also went down during the Cloudflare outage, highlighting the interconnectedness of internet infrastructure. The Cloudflare incident serves as a valuable case study for engineers and technology leaders, underscoring the importance of robust error handling, careful change management, and a proactive approach to incident response. Cloudflare's openness in sharing the details of its outage provides a valuable learning opportunity for the entire industry.

Sponsored
Alex Chen

Alex Chen

Senior Tech Editor

Covering the latest in consumer electronics and software updates. Obsessed with clean code and cleaner desks.


Read Also

X and Other Online Services Briefly Disrupted Monday Morning, Cause Remains Unclear
Technology
NYT Tech

X and Other Online Services Briefly Disrupted Monday Morning, Cause Remains Unclear

Users across the globe experienced brief disruptions accessing X and other online services early Monday morning. While the issues appear to have been resolved within a couple of hours, the root cause of the outage remains unknown. Cloudflare, a major internet infrastructure provider, initially reported a minor issue but clarified that it wasn't related to a wider outage affecting its customers.

#X#Twitter
Base44 Launches AI-Powered Backend Platform to Streamline Development
Startups
Product Hunt

Base44 Launches AI-Powered Backend Platform to Streamline Development

Base44, a new backend platform, aims to simplify development in the age of AI by providing tools and infrastructure optimized for artificial intelligence applications. The platform promises to reduce the complexities of backend development, allowing developers to focus on building and deploying AI-driven solutions more efficiently. This could significantly accelerate the development lifecycle for AI products across various industries.

#AI#Infrastructure
Cloudflare Hit by Second Major Outage in Weeks: Global Configuration Changes Blamed
Technology
Pragmatic Engineer

Cloudflare Hit by Second Major Outage in Weeks: Global Configuration Changes Blamed

Cloudflare, a leading content delivery network (CDN), experienced its second significant outage in under a month, impacting nearly a third of its HTTP traffic. The root cause, similar to the previous incident, stems from a flawed global configuration change, raising concerns about reliability and the need for backup CDN strategies. This latest incident underscores the inherent risks in rapidly deploying changes across vast, distributed networks.

#Outage#Cloudflare
When the Uptime Monitor Goes Down: How Downdetector's Cloudflare Dependency Highlights a Costly Trade-off
Technology
Pragmatic Engineer

When the Uptime Monitor Goes Down: How Downdetector's Cloudflare Dependency Highlights a Costly Trade-off

The irony wasn't lost on anyone: during a significant Cloudflare outage in November 2025, Downdetector, the very service meant to track such disruptions, also went dark. This incident exposed Downdetector's reliance on Cloudflare for critical services and sparked a conversation about the practical realities of managing upstream dependencies, especially when balancing cost and performance.

#Outage#DownDetector