Cloudflare's outage, June 21 - A Human Tragedy
šŸŒ©ļø

Cloudflare's outage, June 21 - A Human Tragedy

Created
Jul 1, 2022 01:27 PM
Last Updated
Last updated July 1, 2022
Owners
Tags
report
cloudflare
tech
devops
development
Status
Current šŸ‘
Edge is terribly trendy. Move cloudy workloads as close to the user as possible, the thinking goes, and latency goes down, as do core network and data center pressures. It's trueĀ  ā€“ until the routing sleight-of-hand breaks that diverts user requests from the site they think they're getting to the copies in the edge server.
If that happens, everything goes dark ā€“ as it didĀ last week at Cloudflare, edge lords of large chunks of web content. It deployed a Border Gateway Protocol policy update, which promptly took against a new fancy-pants matrix routing system designed to improve reliability. Yeah. They know.
It took some time to fix, too, because in the words of those in the know, engineers "walked over each other's changes" as fresh frantic patches overwrote slightly staler frantic patches, taking out the good they'd done. You'd have thought Cloudflare of all people would be able to handle concepts of dirty data and cache consistency, but hey. They know that too.
What's the lesson? It's not news that people make mistakes, and the more baroque things become, the harder they are to guard against. It's just that what gets advertised on BGP isn't just routes but things crapping out, and when you're Cloudflare that's what the C in CDN becomes. It's not the first time it's happened, nor the last, and one trusts the company will hire a choreographer to prevent further op-on-op stompfests.
Yet if it happens, and keeps happening, why aren't systems more resilient to this sort of problem? You can argue that highly dynamic and structurally fluid routing mechanisms can't be algorithmically or procedurally safeguarded, and we're always going to live in the zone where the benefits of pushing just a bit too hard for performance is worth the occasional chaotic hour. That's defeatist talk, soldier.
There's another way to protect against the unexpected misfire, otherĀ  than predicting or excluding. You'll be using it already in different guises, some of which have been around since the dawn of computer time: state snapshotting. No matter what a computing device is doing, it's going from one state to another and those states absolutely define its functioning. Take a snapshot and you're freezing that state in time; return to that state and it's as if nothing that happened after that point never happened. The bad future has been erased. God-like power.
It seems so mundane. The concept is behind ctrl-Z, backups, journaling file systems, auto-save, speculative execution, and much more besides. We almost never think of all these as the same idea, because they're all individual bodges we keep reinventing. Can the concept be generalized, so we can engineer it into our systems as a fundamental property? If we did, what would that look like?
When Apple named its backup system Time Machine, it came the closest of any high-profile tech company to making the concept explicit. It's immediately obvious what the state is that's being snapshotted ā€“ files, static collections of bits, are nothing but state.Ā  Make a copy of this, store it with a label and a timestamp, and you're done.
Journaling file systems are smarter, they understand that a file can change rapidly when it's in use and a cloddish copy every time a byte changesĀ  is impractical. So they keep a journaling data structure of changes made between a file opening and closing. If things go wrong, the last copy can have the journal applied and 'Bang, we're back.' Would those Cloudflare engineers have appreciated a "Bang! You're back!" button? Would we? Form a queue.
The same is true of undo mechanisms in editors, where the journaling data is kept locally, which makes stepping back and forward much easier: a journaling file system works across all applications, but doesn't give that level of control. You can pick apart all sorts of state snapshotting and find different mechanisms,Ā  depending on what works best for each. In fact, there may even be extra benefits.
Is it actually universally true that state can be retrieved? The universe thinks so, with quantum information theory saying just that - although the cost is another matter. More prosaically, can state in a distributed asynchronous system ā€“ looking at you, Cloudflare, looking at you, BGP ā€“ be saved?
Turns out it can. the best known example is theĀ Chandry-Lamport algorithm. It boils down to messages, timing and leaving it to each entity to decide how to save and restore its state, but if you get that right, it's generally applicable.
That "getting it right" for things like time and messaging in a distributed system is non-trivial shouldn't put us off. This will never be a snippet of code you can include to add time travel to anything, but it does point the way to the sort of standardized, API-able option that very different but co-operative components could use to provide common functionality. There'll be complexity and resource costs, but why not push the envelope on robustness for a change?
As for unexpected benefits, that's up to us. Imagine a browser with a time dial, where you can skip backwards and forwards along the timeline of your tabbed URLs. We've all hunted uselessly for that long-closed tab ā€“ history and bookmarks? Pah ā€“ even if we have a good idea of when we last used it. But a browser could just regularly dump state to a time-series database for itself or other tools to use. It'd be fun to see that as a 3D visualization, and besides, it would just be us doing for ourselves what Google does to us anyway.
That we don't think this way is a hangover from the days when storage and CPU were too expensive to waste on just-in-case insurance. Those days are gone, we spend both on trivia with the abandon of a Victorian libertine.Ā  Yet we actually have the makings of aĀ Doctor Who-style Tardis for our universe of information. Let's start building it. They certainly need one at Cloudflare.
Ā 

Introduction

Today, June 21, 2022, Cloudflare suffered an outage that affected traffic in 19 of our data centers. Unfortunately, these 19 locations handle a significant proportion of our global traffic. This outage was caused by a change that was part of a long-running project to increase resilience in our busiest locations. A change to the network configuration in those locations caused an outage which started at 06:27 UTC. At 06:58 UTC the first data center was brought back online and by 07:42 UTC all data centers were online and working correctly.
Depending on your location in the world you may have been unable to access websites and services that rely on Cloudflare. In other locations, Cloudflare continued to operate normally.
We are very sorry for this outage. This was our error and not the result of an attack or malicious activity.

Background

Over the last 18 months, Cloudflare has been working to convert all of our busiest locations to a more flexible and resilient architecture. In this time, weā€™ve converted 19 of our data centers to this architecture, internally called Multi-Colo PoP (MCP): Amsterdam, Atlanta, Ashburn, Chicago, Frankfurt, London, Los Angeles, Madrid, Manchester, Miami, Milan, Mumbai, Newark, Osaka, SĆ£o Paulo, San Jose, Singapore, Sydney, Tokyo.
A critical part of this new architecture, which is designed as aĀ Clos network, is an added layer of routing that creates a mesh of connections. This mesh allows us to easily disable and enable parts of the internal network in a data center for maintenance or to deal with a problem. This layer is represented by the spines in the following diagram.
notion image
This new architecture has provided us with significant reliability improvements, as well as allowing us to run maintenance in these locations without disrupting customer traffic. As these locations also carry a significant proportion of the Cloudflare traffic, any problem here can have a very wide impact, and unfortunately, thatā€™s what happened today.

Incident timeline and impact

In order to be reachable on the Internet, networks like Cloudflare make use of a protocol calledĀ BGP. As part of this protocol, operators define policies which decide which prefixes (a collection of adjacent IP addresses) are advertised to peers (the other networks they connect to), or accepted from peers.
These policies have individual components, which are evaluated sequentially. The end result is that any given prefixes will either be advertised or not advertised. A change in policy can mean a previously advertised prefix is no longer advertised, known as being "withdrawn", and those IP addresses will no longer be reachable on the Internet.
notion image
While deploying a change to our prefix advertisement policies, a re-ordering of terms caused us to withdraw a critical subset of prefixes.
Due to this withdrawal, Cloudflare engineers experienced added difficulty in reaching the affected locations to revert the problematic change. We have backup procedures for handling such an event and used them to take control of the affected locations.
03:56 UTC: We deploy the change to our first location. None of our locations are impacted by the change, as these are using our older architecture.06:17: The change is deployed to our busiest locations, but not the locations with the MCP architecture.06:27: The rollout reached the MCP-enabled locations, and the change is deployed to our spines.Ā This is when the incident started, as this swiftly took these 19 locations offline.06:32: Internal Cloudflare incident declared.06:51: First change made on a router to verify the root cause.06:58: Root cause found and understood. Work begins to revert the problematic change.07:42: The last of the reverts has been completed. This was delayed as network engineers walked over each other's changes, reverting the previous reverts, causing the problem to re-appear sporadically.08:00: Incident closed.
Ā 
The criticality of these data centers can clearly be seen in the volume of successful HTTP requests we handled globally:
notion image
Even though these locations are only 4% of our total network, the outage impacted 50% of total requests. The same can be seen in our egress bandwidth:
notion image

Technical description of the error and how it happened

As part of our continued effort to standardize our infrastructure configuration, we were rolling out a change to standardize the BGP communities we attach to a subset of the prefixes we advertise. Specifically, we were adding informational communities to our site-local prefixes.
These prefixes allow our metals to communicate with each other, as well as connect to customer origins. As part of the change procedure at Cloudflare, a Change Request ticket was created, which includes a dry-run of the change, as well as a stepped rollout procedure. Before it was allowed to go out, it was also peer reviewed by multiple engineers. Unfortunately, in this case, the steps werenā€™t small enough to catch the error before it hit all of our spines.
The change looked like this on one of the routers:
[edit policy-options policy-statement 4-COGENT-TRANSIT-OUT term ADV-SITELOCAL then] + community add STATIC-ROUTE; + community add SITE-LOCAL-ROUTE; + community add TLL01; + community add EUROPE; [edit policy-options policy-statement 4-PUBLIC-PEER-ANYCAST-OUT term ADV-SITELOCAL then] + community add STATIC-ROUTE; + community add SITE-LOCAL-ROUTE; + community add TLL01; + community add EUROPE; [edit policy-options policy-statement 6-COGENT-TRANSIT-OUT term ADV-SITELOCAL then] + community add STATIC-ROUTE; + community add SITE-LOCAL-ROUTE; + community add TLL01; + community add EUROPE; [edit policy-options policy-statement 6-PUBLIC-PEER-ANYCAST-OUT term ADV-SITELOCAL then] + community add STATIC-ROUTE; + community add SITE-LOCAL-ROUTE; + community add TLL01; + community add EUROPE;
This was harmless, and just added some additional information to these prefix advertisements. The change on the spines was the following:
[edit policy-options policy-statement AGGREGATES-OUT] term 6-DISABLED_PREFIXES { ... } ! term 6-ADV-TRAFFIC-PREDICTOR { ... } ! term 4-ADV-TRAFFIC-PREDICTOR { ... } ! term ADV-FREE { ... } ! term ADV-PRO { ... } ! term ADV-BIZ { ... } ! term ADV-ENT { ... } ! term ADV-DNS { ... } ! term REJECT-THE-REST { ... } ! term 4-ADV-SITE-LOCALS { ... } ! term 6-ADV-SITE-LOCALS { ... } [edit policy-options policy-statement AGGREGATES-OUT term 4-ADV-SITE-LOCALS then] community delete NO-EXPORT { ... } + community add STATIC-ROUTE; + community add SITE-LOCAL-ROUTE; + community add AMS07; + community add EUROPE; [edit policy-options policy-statement AGGREGATES-OUT term 6-ADV-SITE-LOCALS then] community delete NO-EXPORT { ... } + community add STATIC-ROUTE; + community add SITE-LOCAL-ROUTE; + community add AMS07; + community add EUROPE;
An initial glance at this diff might give the impression that this change is identical, but unfortunately, thatā€™s not the case. If we focus on one part of the diff, it might become clear why:
! term REJECT-THE-REST { ... } ! term 4-ADV-SITE-LOCALS { ... } ! term 6-ADV-SITE-LOCALS { ... }
In this diff format, the exclamation marks in front of the terms indicate a re-ordering of the terms. In this case, multiple terms moved up, and two terms were added to the bottom. Specifically, the 4-ADV-SITE-LOCALS and 6-ADV-SITE-LOCALS terms moved from the top to the bottom. These terms were now behind the REJECT-THE-REST term, and as might be clear from the name, this term is an explicit reject:
term REJECT-THE-REST { then reject; }
As this term is now before the site-local terms, we immediately stopped advertising our site-local prefixes, removing our direct access to all the impacted locations, as well as removing the ability of our servers to reach origin servers.
On top of the inability to contact origins, the removal of these site-local prefixes also caused our internal load balancing system Multimog (a variation of ourĀ Unimog load-balancer) to stop working, as it could no longer forward requests between the servers in our MCPs. This meant that our smaller compute clusters in an MCP received the same amount of traffic as our largest clusters, causing the smaller ones to overload.
notion image

Remediation and follow-up steps

This incident had widespread impact, and we take availability very seriously. We have identified several areas of improvement and will continue to work on uncovering any other gaps that could cause a recurrence.
Here is what we are working on immediately:
Process: While the MCP program was designed to improve availability, a procedural gap in how we updated these data centers ultimately caused a broader impact in MCP locations specifically. While we did use a stagger procedure for this change, the stagger policy did not include an MCP data center until the final step. Change procedures and automation need to include MCP-specific test and deploy procedures to ensure there are no unintended consequences.
Architecture: The incorrect router configuration prevented the proper routes from being announced, preventing traffic from flowing properly to our infrastructure. Ultimately the policy statement that caused the incorrect routing advertisement will be redesigned to prevent an unintentional incorrect ordering.
Automation: There are several opportunities in our automation suite that would mitigate some or all of the impact seen from this event. Primarily, we will be concentrating on automation improvements that enforce an improved stagger policy for rollouts of network configuration and provide an automated ā€œcommit-confirmā€ rollback. The former enhancement would have significantly lessened the overall impact, and the latter would have greatly reduced the Time-to-Resolve during the incident.

Conclusion

Although Cloudflare has invested significantly in our MCP design to improve service availability, we clearly fell short of our customer expectations with this very painful incident. We are deeply sorry for the disruption to our customers and to all the users who were unable to access Internet properties during the outage. We have already started working on the changes outlined above and will continue our diligence to ensure this cannot happen again.
Ā 

Ā