They say that hindsight is 20/20, and the calendar year 2020 is certainly best observed from the rearview mirror. In the midst of a global pandemic, widespread shelter-in-place orders, and economic shutdowns, the Internet has become our most critical lifeline to staying connected and getting business done. Yet, the Internet is fraught with performance glitches and hidden dependencies that can take down critical services and applications at any time. While Internet outages can (and do) happen to even the most sophisticated enterprise, there is always a silver lining—a learning to be had.
In this blog, I’m going to cover the most disruptive outages of 2020 (dealer’s choice) and discuss what enterprise IT teams can learn from them as we enter the new year.
Most Disruptive Internet Outages of 2020 (Chronological Order)
The year had been off to a relatively quiet start, with very few large-scale Internet outages to speak of—despite worries that the Internet might buckle under the unprecedented strain resulting from the pandemic. However, all good things must come to an end, and the Internet’s “good streak” came to an abrupt halt on April 1, as the Russian telecommunication behemoth, Rostelecom perpetrated a large-scale BGP hijack involving more than 8,000 prefixes, including those belonging to Google, Facebook, Akamai, Cloudflare and Amazon.
While the nature of this hijacking incident did not appear to be malicious (as the illegitimate announcements were quickly revoked and the nature of the announcements indicates a BGP optimizer gone rogue), that didn’t stop it from causing widespread impacts. Those who managed to skirt the hijacking, interestingly, were RPKI-supporting ISPs such as Telia and NTT. During the course of the hijacking, they were able to preserve healthy routing through their networks, while providers such as Level 3, Hurricane Electric and GTT were not.
What can we learn from this? BGP hijacks can happen, whether intentionally or not, to even the most sophisticated FAANG. Deploying RPKI as a best practice helps to strengthen BGP security, and, as RPKI adoption increases, the impact of BGP route security incidents (either accidental or malicious) will likely diminish. In the meantime, having visibility into the integrity of routing to one’s services is critical to ensuring both internal and customer security.
IBM Cloud’s outage on June 9 disrupted the reachability of services hosted within the cloud provider’s network. Almost simultaneously, ThousandEyes’ vantage points from around the globe were no longer able to interact with IBM Cloud-hosted services, and high levels of packet loss that were recorded indicated the disruption to be network-related.
Interestingly, traffic destined for IBM Cloud was not getting dropped, rather the traffic from the destination back to user locations was either completely or partially dropped within IBM Cloud’s network. IBM Cloud’s public statements pointed to a BGP route leak from one of its peers, and potentially, issues with a third-party networking partner. From our perspective, there were three telltale factors pointing to a control plane issue: 1) the global impact, 2) the high but not always a hundred percent packet loss indicating traffic constraint or impedance of network capacity, and 3) the fact that traffic egressing rather than ingressing the cloud provider was getting dropped.
What enterprise IT teams should learn from this incident is that even with redundancy measures in place with a cloud provider, the compromise of critical dependencies, such as BGP, can have a catastrophic impact on users. Such is the delicate nature of delivering services over the Internet. Hope for the best, but plan for the worst case scenario.
Imagine it’s 1995 and your water pipe bursts. You forklift your phonebook out of a closet and find the number of a nearby plumber who can swing by at a moment’s notice to rescue your basement from a watery demise. Well, DNS (Domain Name System) is the Internet’s phonebook, resolving machine-readable IP addresses with human-readable domain names. And, on July 17, one of the largest DNS providers, Cloudflare, experienced a brief outage that caused a massive impact on Internet users, not only attempting to reach domains managed by Cloudflare but any domain through their public resolver. (In water speak, the plumber’s not coming.)
While a quickly released post-mortem identified a router configuration error as the root cause of the issue, there are notable lessons we can learn from this incident. Chief among them, modern enterprises and users rely on a complex, often unseen, web of dependencies, any of which can impact the reachability of its services. Understanding these dependencies and where additional resilience measures are needed is critical to maintaining business continuity.
Enterprise IT teams require visibility into critical service components, be it DNS infrastructure or CDN providers, along with Internet routing visibility to truly understand architectural risks, as well as the impact of outages on service delivery. Armed with this information you can not only understand the “why” behind the oft-repeated phrase the “Internet is down” but also put yourself in a position to expedite communication and remediation in the future.
If you’re a business in and around London, you may have noticed the prolonged August 18th outage at Equinix’s LD8 facility in London. Apparently, there was a power outage that impacted both of their “independent” sources of power. This lasted the whole day, and it impacted their customers as well as other providers, including other exchange providers like LINX because they maintain infrastructure within this facility.
What made this outage especially disruptive was that the IXP interfaces with other large service providers. So we saw, simultaneously, a number of different providers being impacted (such as TeliaNet, Cogent, NTT, and Level 3)—particularly on their infrastructure within the UK. However, unlike other service provider outages that tended to be widespread or global in impact, this outage really was contained to the blast radius of this Equinix power outage.
This example reminds us of the importance of redundancy when it comes to your overall resilience profile—you have to think about it from various different perspectives while you’re architecting your application, not just from at the app level, but the physical infrastructure, as well as your peering. While the IXP in this case had redundant power sources, they were both compromised in this outage. Certain service providers may be more resilient in the face of infrastructure going down, which is always something to query and consider when you manage and evaluate your vendors.
Then, on August 30, the big one: CenturyLink/Level 3 suffered a control plane failure, lasting nearly 5 hours. As a major global ISP peering with many app providers and enterprises, including Google, Cloudflare and OpenTable, the blast radius of this outage was extremely wide, as CenturyLink/Level 3 effectively terminated a large portion of Internet traffic around the world.
At the peak of the outage, nearly 522 interfaces were impacted, including Level 3, as well other ISPs on their peering connections with Level 3. While the service impact of the outage was total packet loss across CenturyLink’s geographically distributed infrastructure, the cause of the outage was reportedly due to a crippled controlplane brought on by the internal propagation of a faulty BGP announcement. In an expanded analysis provided to its customers, CenturyLink indicated that an improperly configured flowspec was part of an unsuccessful effort to block unwanted traffic on behalf of a customer — a routine internal use case for flowspec.
This outage reminds us that every enterprise is a part of the greater Internet whole, and subject to its collective issues. So understanding your risk factors requires an understanding of who is in your wider circle of dependencies and how their performance and availability could impact your business if something were to go wrong. Enterprises must maintain visibility into the routing, availability, and performance of their critical providers, as external communication on status and root cause can vary widely by provider and is often slow to arrive. When it does, it may be past its usefulness in addressing an issue proactively.
While many of the prior outages were network-related, this next outage reminds us that the complexities of modern applications can sometimes have unintended consequences. On September 28, Microsoft experienced a global service incident that impacted the reachability of nearly all of its applications and services—as well as third-party apps and services that use Azure Active Directory (AAD).
Using the ThousandEyes platform, we observed the incident from vantage points around the globe, confirming not only that Microsoft’s frontend web servers were reachable and unimpeded by network-related issues, but also that status codes and error messages received from Microsoft’s servers indicated internal issues within its Azure AD service—a service that Microsoft later identified to be the source of the disruption.
Given the complexity of most enterprises’ heterogeneous service ecosystem, it’s to be expected that things will go wrong. The important thing is to be prepared and have visibility. Your internally-developed applications likely have more dependencies than you realize, not only from a delivery standpoint (e.g. CDN and DNS providers), but backend dependencies that enable aspects of your application (e.g. third-party API services). So understanding how each component of your business-critical applications work together—the critical elements, authentication paths, potential points of failure, and even the performance of each object and interaction—ensures that you can properly manage them, fixing and optimizing what you own, staying knowledgeable and alerted to what you don’t own.
Calendar year 2020 will go down in the record books for a number of reasons. While our dependence on the Internet has never been greater, we also witnessed some of the largest Internet outages ever recorded. I hope that this look back at the most disruptive Internet outages of 2020 and the lessons we can learn from them, give context to enterprises when it comes to Internet resilience.
It’s definitely not a surprise that the Internet has become the new enterprise network, only accelerated by the pandemic. Yet, the Internet is a massive blindspot for most IT organizations. It’s time to change that. Schedule a demo today to see how ThousandEyes gives you the visibility you need to manage the Internet like it’s your own network.