Cloudfare Outage(21/06/22)
Breakdown

Curious about Outages, Distributed Systems, DevOps, Backend, SRE related stuffs.
Another day, another outage, another day to learn something as an engineer from the widespread outage. In the scale point of view, Cloudflare outage was bigger than the Atlassian outage, this one almost took half of the internet with it. Discord, Zerodha, Shopify, Amazon Web Services, Twitter, Canva, leetcode, Omegle, doordask, feedly, Zerodha, Upstox and the list of impacted companies just goes on. But we are not interested in that, we are only interested in the behind the scenes of the outage. This outage is a networking concept nightmare, still I will try to make it short and simple. So let's delve into this :-
What actually happened
Almost 19 of Cloudflare data centers went down worldwide but as a cascading failure almost 35 data centers went down or re-routed. If you want to know more about cascading failures then you can checkout my previous article - Google Maps Outage or Complete dissection by Arpit Bhayani.The root cause of the outage is connected to a project upgrade which was meant to increase resilience of the systems. Apparently not a security issue, it was an engineering mistake that kept many of the popular services throughout world on halt for more than 1 hour.
For the past 1.5 years they were trying to make their systems more resilient and flexible. They connected all those 19 data centers to behave as one to test the new method. Earlier every data centers used to behave as individual PoP. By connecting all those they created a Multi-Colo PoP, it's idea is similar to the Color Pop Board Game.
In simple words it was kind of mesh of connection acting as another layer used for routing, which is also called Clos Network. This new structure was supposed to help them in achieving resilience of the systems. Below is a diagram given by the Cloudflare team to understand the structure. Here all the spines suggest this mesh layer.
They use the BGP which basically enables connection between any request in the world with the origin servers very quickly. This BGP involves complex policies which decide whether an IP address should be routed to their servers or not. Any changes in these policies will mess up the deciding factors to choose a valid IP address, there is a chance that all the kind of IP address which used to get advertised earlier now won't be able to identified as validated at all. So when they modified the policies, due to some error the previous IP addresses which used to come from those 19 data centers just turned unrecognizable.
Since the rules defined for prefixes of an IP address was messed up in the policies of the BGP, they removed the Site-Local prefixes. In simple words, they modified the policy in a way that those 19 data centers won't be able communicate straight to the origin servers, by mistake. This lead to an overload on their internal Load Balancer also, it means their smaller computing clusters were receiving almost the same amount of traffic as the larger computing clusters.
What they did
The main rule of incident handling is "No Blame Culture", find the root cause-solution, document properly, take precautions so that it doesn't get repeated in future, even if it gets repeated then structured frameworks should be there to help the on-call engineers. So let's see what they did :-
They caught the error within a short period of time, later re-routed almost all the impacted data centers(around 35 when I checked at 3.30pm IST).
They had a rollback plan for this kind of incident so they followed the existing backup mechanism.
What they could have done
As they tried this kind of experiment straight on the prod, they should have some environment where they could try this. For this they could have adapted Moldable Development Tools too.
Proper real-time in details status should be there, visible to the customers. I have seen few of the companies complaining about the outage on twitter whereas it would have been better if they could connect to Cloudflare team straight and discuss in details. Wrote something about it here.
Engineers from few companies understood that the error was from Cloudflare, so they started re-routing or followed a temporary migration. But what about those who took time to understand that it was an error from a cloud-dependent service which they use. This small delay impacted their business. So to avoid this they should provide an overall health metric of the systems also. Wrote something about here.
The change in the policy, advertisement, router configuration led to this which was an human error. So from next time onwards they can make this an automated process to avoid human error.
that's it from my side.
thanks for reading upto here😊
source - cloudflare incident report

