Get In Touch
April 2013 Archives
In reviewing past media and industry blog coverage of Akamai's State of the Internet Report, we find that there has been some confusion about, or misinterpretation of, terms and/or data within the report. In advance of the upcoming publication of the 4th Quarter, 2012 State of the Internet Report, we thought that this would be a good opportunity to provide some additional clarification.
Distribution: What good is it?
You'll recall from Part I that I describe a scenario where the London node of a network has become overloaded in a DDoS attack. Now you may wonder why if all the users in that scenario going to London are having problems, what good is distribution? Some of you may have already noticed the benefit I glossed over earlier. In our example with a congested London node, the users in San Jose, Tokyo, Sydney, etc. are all unaffected.
This is great news. Not only does distribution make it more likely no node will be overwhelmed, but if one is, there should be lots of others which are not. This minimizes the damage as not all users will suffer during a failure.
At this point you should start feeling sorry for the poor sods going to the London node. Let's see if we can do anything about that.
Overwhelmed nodes: Any way to avoid user pain?
I have tried to convince you that some node or link will get overwhelmed no matter what. But even if you do not believe my logic, the empirical evidence is clear: This has happened, to every CDN. Yes even including Akamai. < I can hear the audible gasps from here. >
If you combine the fact that every CDN has had nodes overwhelmed, and what I said above about users going to the overwhelmed node suffering, the logic seems to say that attacks can and do harm users. Luckily, there is a way to escape this seemingly inevitable fate: Do not send users to overwhelmed nodes.
If only that were as simple as it sounds.
Anycast: How granular is BGP?
Most CDNs use anycast to direct users, either through anycasting their name servers or anycasting their web servers. BGP is a crude tool, lacking granularity and precise control.
Going back to our London example, if a CDN wanted to move traffic off the London node, it has to change something in BGP. If the CDN is anycasting its name server, chances are all it can do is direct traffic away for entire networks. You cannot tell a network "send your east London users to this node's name server, your west London users to Frankfurt's name server" with BGP. Moreover, unless the CDN has multiple prefixes with different name servers in each, it cannot say "send traffic for Customer A to Frankfurt and traffic for Customer B to Amsterdam."
If the CDN is anycasting its web servers, there might be slightly more flexibility. It is possible to send users to London for some web server addresses and Paris for other web server addresses. However, you can still only direct users by network, not sub-group.
Furthermore, many networks require peering partners to do what is called announce consistently. This forces a CDN to announce the exact same thing in BGP to a network in every point where they peer. Without the ability to modify BGP announcements per node, a CDN cannot affect where traffic flows.
Finally, some things in BGP are not black and white. A CDN can remove reachability information, e.g. "you cannot get to web server XYZ in London." But anything short of that, such as "please use Madrid first, and come to London if Madrid is down," is purely a suggestion. The network receiving the BGP announcement is allowed to listen to or ignore any hints provided by the CDN. This means you can say "please use Madrid first, then London" and the peer network might say "no, I'm going to London first." There is nothing the CDN can do other than remove London as a choice completely.
Now, imagine trying to mitigate a massive attack across multiple networks and multiple nodes when the tools you have involve hints which might be ignored or the ability to move traffic from whole network or CDN nodes at once, plus the million other details I did not cover.
Yeah, I don't want to think about it either.
Akamai Mapping: Does it use BGP?
Fortunately for Akamai, we do not use BGP to map users to web servers. Akamai's Mapping System can and does notice overwhelmed nodes in seconds and directs users, regardless of their ISP or internal BGP preferences, to other nodes.
Akamai has many ways of finding problem nodes and fixing them. We send probes out from each node, as well as probes into the nodes. And if that were not enough, we track TCP stats on the node which gives us telemetry on production traffic to real users. Node gets overwhelmed, traffic is moved seconds later automatically. Human involvement is neither required or preferred - people cannot move as fast as computers.
Moreover, Akamai's system is based on DNS, not BGP. We can, and frequently do, direct "east London users to Node A and west London users to Node B" from the same network. Or even "east London users to Node A for Customer Z and west London users to Node B for Customer Y" from the same or multiple networks.
This means even if an attack takes out one of our nodes, the collateral damage is minimal and very short.
On top of that, Akamai has the most traffic of any CDN. By some estimates, we have nearly as much as all other CDNs combined. Having double-digit terabits of outbound traffic means we have to have a lot more than a few hundred Gbps of inbound capacity.
All these things together make not just serving at the edge, but serving at the edge Akamai style, a great way to fight DDoS.
Patrick Gilmore is a Chief Network Architect at Akamai
Akamai's Chief Security Officer Andy Ellis recently commented on large DDoS attacks and how "size" can be misleading. In that post, Andy notes if you have more than 300 Gbps of ingress capacity, then a 300 Gbps attack is not going to hurt you too much.
He's right of course. However, total ingress capacity is only part of the equation. Also important are the type of attacks you're facing and your "ingress" configuration. I'd like to dig a little deeper into these two topics, and explain how a widely distributed infrastructure is useful for both improving performance and mitigating attacks.
Not surprisingly, I've used a generic CDN example to set the stage, but most of the concepts here apply to any large backbone network with many peering and transit links. Because we are talking about CDNs, we should first ask why CDNs push traffic to the "edge", and even before that, what is the "edge"?
Why serve at the edge?
On the Internet, the "edge" usually refers to where end users ("eyeballs") are located. It is not a place separate from the rest of the internet. In fact, frequently the edge and what most people consider the core are physically a few feet from each other. Topology matters here, not geography.
The reason CDNs want to serve at the edge is it improves performance. Much has been written about this, so I shan't bore you here. The important thing to realize is all CDNs distribute traffic. However, when CDNs were invented, the distribution was not designed to mitigate attacks, it just worked out that way.
And it worked out well. Let's see why.
Ingress capacity: How much is enough?
To set the stage further, we are going to discuss a "typical" DDoS (as if there were such a thing!) and possible mitigation strategies, not a specific attack.
The first and most obvious mitigation strategy is what Andy mentioned in his post: Have enough ingress capacity to accept the traffic, then drop or filter the attack packets. This begs the question of what "ingress capacity" means. If you have a single pipe to the Internet, making that pipe bigger is the only answer. While plausible, that would be very difficult, and very, very expensive to do with the size of attacks seen on the 'Net today.
Now, suppose you have many ingress points, such as a CDN with multiple links and nodes. Do you need to ensure each and every point is large enough for the worst-case DDoS scenario? Doing so would be insanely expensive and, frankly, nearly impossible. Not every point on the Internet can support connections into the 100s of Gbps.
Fortunately, the first 'D' in DDoS is "Distributed", meaning the source of a DDoS is not a single location or network, but spread widely. And because of the way the Internet works, different attack sources will typically hit different CDN nodes and/or links.
The chance of a distributed attack hitting all the same node is very, very small. Of course, the opposite holds as well - the chances of an attack having a perfect spread over all points is essentially nil. As such, you cannot just divide the attack by the number of nodes and assume that amount is the maximum required bandwidth per node to survive an attack. How much more capacity is needed per node depends on the exact situation, and it cannot be predicted in advance. This leads us to our next topic.
Node capacity: How much is enough?
Trying to size each node properly for an attack is an art, not a science. A black art. A black art that is guaranteed to fail at some point.
Of course, everyone still tries. They attempt to estimate what attack capacity is needed based on things like where a node is, what customers are on the system, how much connectivity costs, and several other factors. For instance, a node in Düsseldorf serving a small DSL network exclusively probably does not need as much attack capacity as a large node in Silicone Valley serving several Tier-1 providers combined. Engineers pray to the network gods, sacrifice a router and maybe a switch or two in their names, make a plan, implement it, and... pray some more.
But pray as they might, the plan will fail. Sooner or later, some attack will not fit the estimates, and a node will be overwhelmed with attack traffic. Don't let this get you down if you are making your own attack plan, remember the same is true for everyone. Not only is it impossible to predict how large an attack is going to be, but as mentioned above, it is also impossible to predict what percent of the attack will hit each node. Worse, since unused attack capacity is wasted money - a lot of wasted money - CFOs tend to push for less rather than more, making the plan that much more likely to fail.
The problem with an overloaded node is it doesn't matter how many nodes you have, if one is overloaded, any traffic going to that node will be affected. This means if your link to London is overwhelmed with attack traffic, it doesn't matter how many nodes you have in Tokyo, Sydney, San Jose, etc., your users going to the London node are suffering.
As a result, while CFOs push for less, engineers push for more.
In Part II, I'll cover in greater detail why a distributed infrastructure, such as the Akamai Intelligent Platform, is ideal for mitigating even the largest of DDoS attacks.
Patrick Gilmore is a Chief Network Architect at Akamai
Looking at this partnership, it brings many mutual benefits to both companies. First, it provides KT with greater network efficiency for all the Akamai global content that gets served within South Korea. Korea has always been a leading edge broadband market with some of the fastest speeds in the world, and this relationship provides an even faster Internet experience, which certainly benefits our customers as well. Second, KT will be able to have their own CDN based on market-leading technology from Akamai, and sell these services to their regional customers bringing benefits to content owners, web properties, and enterprises alike. And finally, Akamai receives the opportunity to leverage KT's experienced sales organization to increase the overall market opportunity for our services in the region. It is further evidence of our commitment to involve the network service providers in the value chain of CDN in a way that best leverages their assets and ours.
Here is a quote from Heekyung Song, SVP for their Enterprise IT BU. "The era has come where culture and digital content are leading the market. Our main priority is to ensure access to content from any device, at anytime, anywhere. Through this partnership with Akamai, KT will provide a CDN platform specializing in media delivery, web performance and security so companies can focus on developing quality content and web applications without concerns about delivery."
Here is a great picture capturing the enthusiasm of the joint team responsible for the partnership.