Even the Internet Enjoys a Long Weekend; Plus, Digging Into a Recent CDN Outage

Even the Internet Enjoys a Long Weekend; Plus, Digging Into a Recent CDN Outage post thumbnail image

Hosted by Angelique Medina and Archana Kesavan


Watch on YouTube – The Internet Report – Ep. 22: Aug 31 – Sep 6, 2020

This is The Internet Report, where we uncover what’s working and what’s breaking on the Internet—and why. The Internet held up reasonably well over the past week, all things considered. There were no major outages to report, which is a welcome repose for those impacted by the major outages the week prior. While it’s not an outage that occurred this past week, we did want to spend some time covering the recent Verizon Edgecast outage that occurred on August 21st. Watch this episode as we dive into this application-level outage to understand exactly what happened and who might have been impacted.

Find us on:

Finally, don’t forget to leave a comment here or on Twitter, tagging @ThousandEyes and using the hashtag #TheInternetReport.

Catch up on past episodes of The Internet Report here.

Listen on Transistor – The Internet Report – Ep. 22: Aug 31 – Sep 6, 2020

ThousandEyes T-shirt Offer

Follow Along with the Transcript

Archana Kesavan:

Welcome to The Internet Report, where we uncover what’s breaking and what’s working on the Internet. My name is Archana Kesavan, and I’m joined by my cohost Angelique Medina.

Angelique Medina:

Hi, everyone.

Archana Kesavan:

In this week’s highlights, it turns out it’s a pretty quiet week. We came out of that CenturyLink outage last Sunday. And as you can see here, it seems like the Internet was doing just okay.

Angelique Medina:

Yeah. I mean, it may have been that in the aftermath of that outage there were a lot of network engineers that probably figured they wanted to have a little bit of downtime, not touch their network too much. So it seems like things were pretty stable. And also bear in mind that we had the holiday weekend in the US, which may have meant that there were less changes being made to networks by enterprises and service providers. And so that probably is one reason why we saw a really nice sleepy week in many ways. It was good, which is good.

Archana Kesavan:

Which is good. We’ll take it.

Angelique Medina:

No news is good news.

Archana Kesavan:

Yes. And in case you missed the outage analysis that we walked through last week, you can go back, listen to it on our podcast, or read the blog as well. And, summary of that is essentially BGP gone bad and offending BGP flows that announcement kind of broke down CenturyLink’s network and thereby the Internet.

Angelique Medina:

Yeah. And there’s a lot of really interesting kind of takeaways from that. In spite of the fact that level three was not accepting route changes from its peers, there were things that could have been done by their customers in order to mitigate some of the damage from this. Some of those things are listed in that blog, so definitely check it out and while you’re reading it or after you read it, click “subscribe” and “follow”.

Archana Kesavan:

Yep. Definitely. Last week, actually, we wanted to cover another outage which had happened on August 21st. It was actually a Verizon Edgecast outage. The CenturyLink kind of overshadowed it and we thought it was time well spent talking about the CenturyLink outage, so we’re kind of going to go back in time and do an “under-the-hood” of the Verizon Edgecast outage that happened on August 21st.

Angelique Medina:

Yeah. And this one was interesting because it’s kind of a nice contrast to what we covered last week, which was really specifically a network outage. And this is an application-level outage having to do with some issue within Edgecast CDN services, so it wasn’t network related. We could see that network paths to Edgecast Edge were just fine. There wasn’t any issue like packet loss or significant latency. And then the flip side of this was that this was so massive in scale in terms of the number of customers impacted and the services that they announced were impacted that it clearly had nothing to do with their customer’s origin servers either. So we’ll talk a little bit about that.

Archana Kesavan:

Yeah. We’ll walk through a couple of services that did also show the impact of the outage. And this is something I think we covered in the CenturyLink outage as well. How you were impacted depended on too many factors. And this one’s…Angelique you were pointing out is a nice contrast to we saw last week. Now, it turns out Verizon, we don’t have a complete RCA as to what exactly transpired, but they did acknowledge the outage. And, as you can see here, it was one that was completely widespread in terms of their locations that were impacted, but also different types of CDN services were impacted to their application, their storage, and Edge and all of that.

Angelique Medina:

Yeah. And what’s interesting about that is that we saw that, so they didn’t issue an RCA, but what we saw in monitoring some of the services or some of their customers services was that this was specifically impacting content that either was not cached and had to be fetched from the origin server or maybe was not locally cached and had to be fetched from some upstream server that might’ve been within their caching network, which is interesting because Verizon principally is a media streaming provider. And so they serve a lot of cached content. So it may have been that during the outage, a lot of the services that you use them may not have seen an immediate effect or really significant effect, but anybody who was using their services for web delivery, which is highly dynamic and often times requires a lot of objects to be fetched from either the origin or other services, they probably would have seen that much more immediately. We saw across all of our tests on the bike at that time.

Archana Kesavan:

Right. It’s interesting. You know, because the type of tests, I mean the response codes that we were seeing, as you can see here, as well as a spike in the 500 response codes, but we did see a lot of 504’s, which are specifically mean that the upstream from the Edge, whether, be that in origin or be that in other caching server, was not able to basically serve the big question.

Angelique Medina:

Right. That’s right. So the 504 is a bad gateway timeout error, which is, you indicated at this, like the server that you’re requesting the content from this timing out, trying to reach some other server to fulfill your request.

Archana Kesavan:

Right. Right. And, the thing about this outage was almost like two to three hours. And that’s kind of the spike that you see here, very specifically, or a 500 response code and only served by Edgecast at this point in time.

Angelique Medina:

So 1700 UTC is like around 1:00 PM Eastern time.

Archana Kesavan:

This was on a Friday, 21st is a Friday. So, again, one of the impacts in terms of Edgecast customers, what they saw depending on cached content vs non-cached content, was varied as well. This is an example where we are actually testing to, did you start a URL, which is served by a guest. And as you can see here in that time frame specified, this is run 1930 UTC. That was still well within the outage time frame. You start seeing that outage starting to show up around two hours ago here, so incomes of this particular outage. What you don’t see here, but we did validate, is that the network path to the Edgecast edge servers were intact. So there was no network or packet loss or routing issues.

Angelique Medina:

Right. Then typically you would see if there was a network issue, you would see maybe a connect issue in some of these errors here and that in the status. And we’re seeing an HTTP error. So this is an application-level error.

Archana Kesavan:

And very clearly the 504 shows up here. And again, essentially, meaning there was something that was disturbing the connectivity to the origin or another server. And the reason this wasn’t an issue specific to an origin was because the number of services that was impacted that were relying on Edgecast. It would have been a terrible coincidence if multiple services, origin servers went down. So this was not the case.

Angelique Medina:

Now, in contrast to that, we see some of their customers. So in this case, this is some element of a service or a service component for Twitter. And you can see that, in fact, during the outage, we were getting content served from Edgecast and we see the HTTP response header showing that the content is available and being served from the Edge. So there’s no errors that are being returned as a result of this.

Angelique Medina:

Now what’s interesting too, is that also just in terms of popularity, I mean, Twitter is probably much more heavily trafficked than Digicert domain. So the likelihood of it being available at more edge servers, the local server that you’re requesting from is much higher, too. So there could be configuration elements or configuration factors at play, but there could also be just be factors in terms of popularity and how densely kind of placed the cache content is because the CDN providers will often purge their cache if they’re not getting a lot of requests for a particular domain or particular content, because they need it or the other content that they’re fulfilling. So that could also be a factor here as well.

Archana Kesavan:

Right. Right. I know you’ve been specifically focused and doing some research and CDNs as well in terms of comparisons and stuff. So, if we have to have a takeaway to cache or not cache, would your input be there?

Angelique Medina:

Yeah. I mean, there’s just a lot of trade-offs, right? For example, let’s just take the root object or the index file. Because that’s typically for most web delivery, most web sites, the customer of the CDN provider oftentimes will either not cache the index file or they’re doing a revalidation, a setting on it where you have to basically go back to the origin every single time. Now, it looked like some of the services that weren’t impacted were not doing that, so they were being served from Edgecast. But the problem is that you don’t have a lot of control if you want to make changes. So if you want to be much more dynamic in terms of the makeup of the page, you want to change different settings or images, or any of that, you have to wait until the cache gets purged. You might still be serving stale content from the Edge, so this gives you a lot of control to make changes on the fly. But then the flip side of that is that something happens with the origin or if something happens on the network trying to reach the origin, then you could potentially create an issue where you’re not able to serve the page or the majority of the page.

Archana Kesavan:

You’re basically increasing the number of dependencies because you’re trying to be dynamic. And I think it’s kind of the same thing with moving to the cloud: it has its benefits, but then you are losing control and to some extent to the visibility into what’s happening there, so…

Angelique Medina:

Exactly, exactly. And at the end of the day, it’s really hard to reduce a lot of dependencies because websites today are just so dynamic and there’s often components that are being served. You almost can’t avoid in a lot of instances having to fetch from the origin or even having to fetch from just a huge number of other services. So there’s already a lot of risk and a lot of this is really app specific, so it’s kind of important to understand what your trade offs will be from a performance standpoint, from a resiliency standpoint. So kind of think about the different scenarios if the origin isn’t available or if the CDN has some issue or if maybe some critical services that you rely on for your site or your application are not available. What’s going to happen in that situation?

Archana Kesavan:

So kind of a domino effect and you bring up a great point about resiliency because that is something we saw as well. I believe this was the Twitter example where it was not only front-ended by Edgecast, but by Fastly as well.

Angelique Medina:

Yeah. Well, this is something we also saw with the Level 3 outage too, right? Because there were some companies who had just one provider or some had multiple, and in the case of Twitter, they’re not just relying on one CDN provider; they have at least two.

Archana Kesavan:

Right. It means for that particular widget that we were looking at, they would definitely two CDN providers. And it’s interesting you bring up the CenturyLink outage. And I think we did see a service that was completely reliant on Level 3 as an upstream. We’ve seen that event with DNS, right? Providers just like basically relying on one DNS service, kind of the start and the end of your service can pop in with rDNS and it’s surprising that it happens. So, yeah. So this again becomes really critical in terms of factoring how you build your app and how you enable other many of these other dependencies that you have today.

Angelique Medina:

Yeah, no, that’s a good point. Just in terms of that, like in the case of the Level 3 issue, when you brought up DNS, one of the things that we were seeing for example, was that even if you had maybe two or more service providers you were using, if your DNS provider was only using or your DNS service was only connected to Level 3, for example, then it didn’t matter that you had multiple service providers for your data center, your customer still wouldn’t have been able to reach you. So you have to think about not only your dependencies, but your dependencies’ dependencies and start kind of going a few levels deep to understand what is the cascading impact of any one of these things failing?

Archana Kesavan:

Right, right. No, that’s very true. Whatever is apparent and visible, obviously keep track of that. But otherwise, some cases you might have to actually spend some time and kind of detecting what these dependencies are because unfortunately what we’ve seen is some of these outages when they happen, that’s when your dependencies are kind of coming to light and that’s not a good place to be. All right. Yeah.

Archana Kesavan:

So, that’s all we have this week. You know, again, if you are interested in learning more about what’s happening on the Internet, some deep dives and outages that we’ve seen, definitely feel free to subscribe to our blog post and again, our podcast as well. And if you want a T-shirt which says, “Working Safely From Home,” feel free to email, InternetReport@thousandeyes.com with your address and size, and you will have that with that. We will see you next week.

Angelique Medina:

See you next week.

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Post