Akamai Diversity
Home > Web Security > Serving at the edge: Good for performance, good for mitigating DDoS - Part II

Serving at the edge: Good for performance, good for mitigating DDoS - Part II

Distribution: What good is it?

You'll recall from Part I that I describe a scenario where the London node of a network has become overloaded in a DDoS attack. Now you may wonder why if all the users in that scenario going to London are having problems, what good is distribution? Some of you may have already noticed the benefit I glossed over earlier. In our example with a congested London node, the users in San Jose, Tokyo, Sydney, etc. are all unaffected.

This is great news. Not only does distribution make it more likely no node will be overwhelmed, but if one is, there should be lots of others which are not. This minimizes the damage as not all users will suffer during a failure.

At this point you should start feeling sorry for the poor sods going to the London node. Let's see if we can do anything about that.

Overwhelmed nodes: Any way to avoid user pain?

I have tried to convince you that some node or link will get overwhelmed no matter what. But even if you do not believe my logic, the empirical evidence is clear: This has happened, to every CDN. Yes even including Akamai. < I can hear the audible gasps from here. >

If you combine the fact that every CDN has had nodes overwhelmed, and what I said above about users going to the overwhelmed node suffering, the logic seems to say that attacks can and do harm users. Luckily, there is a way to escape this seemingly inevitable fate: Do not send users to overwhelmed nodes.

If only that were as simple as it sounds.

Anycast: How granular is BGP?

Most CDNs use anycast to direct users, either through anycasting their name servers or anycasting their web servers. BGP is a crude tool, lacking granularity and precise control.

Going back to our London example, if a CDN wanted to move traffic off the London node, it has to change something in BGP. If the CDN is anycasting its name server, chances are all it can do is direct traffic away for entire networks. You cannot tell a network "send your east London users to this node's name server, your west London users to Frankfurt's name server" with BGP. Moreover, unless the CDN has multiple prefixes with different name servers in each, it cannot say "send traffic for Customer A to Frankfurt and traffic for Customer B to Amsterdam."

If the CDN is anycasting its web servers, there might be slightly more flexibility. It is possible to send users to London for some web server addresses and Paris for other web server addresses. However, you can still only direct users by network, not sub-group.

Furthermore, many networks require peering partners to do what is called announce consistently. This forces a CDN to announce the exact same thing in BGP to a network in every point where they peer. Without the ability to modify BGP announcements per node, a CDN cannot affect where traffic flows.

Finally, some things in BGP are not black and white. A CDN can remove reachability information, e.g. "you cannot get to web server XYZ in London." But anything short of that, such as "please use Madrid first, and come to London if Madrid is down," is purely a suggestion. The network receiving the BGP announcement is allowed to listen to or ignore any hints provided by the CDN. This means you can say "please use Madrid first, then London" and the peer network might say "no, I'm going to London first." There is nothing the CDN can do other than remove London as a choice completely.

Now, imagine trying to mitigate a massive attack across multiple networks and multiple nodes when the tools you have involve hints which might be ignored or the ability to move traffic from whole network or CDN nodes at once, plus the million other details I did not cover.

Yeah, I don't want to think about it either.

Akamai Mapping: Does it use BGP?

Fortunately for Akamai, we do not use BGP to map users to web servers. Akamai's Mapping System can and does notice overwhelmed nodes in seconds and directs users, regardless of their ISP or internal BGP preferences, to other nodes.

Akamai has many ways of finding problem nodes and fixing them. We send probes out from each node, as well as probes into the nodes. And if that were not enough, we track TCP stats on the node which gives us telemetry on production traffic to real users. Node gets overwhelmed, traffic is moved seconds later automatically. Human involvement is neither required or preferred - people cannot move as fast as computers.

Moreover, Akamai's system is based on DNS, not BGP. We can, and frequently do, direct "east London users to Node A and west London users to Node B" from the same network. Or even "east London users to Node A for Customer Z and west London users to Node B for Customer Y" from the same or multiple networks.

This means even if an attack takes out one of our nodes, the collateral damage is minimal and very short.

On top of that, Akamai has the most traffic of any CDN. By some estimates, we have nearly as much as all other CDNs combined. Having double-digit terabits of outbound traffic means we have to have a lot more than a few hundred Gbps of inbound capacity.

Summary

All these things together make not just serving at the edge, but serving at the edge Akamai style, a great way to fight DDoS.


Patrick Gilmore is a Chief Network Architect at Akamai

2 Comments

Hi Rob, this is an interesting insight into the architecture that allows you to offer threat resistant services.

However it does seem to raise difficulties for customers of your customers - of which we are one.
If we're making use of one of your customers secure websites - while the initial interactions are fine because they're to named / known servers, if later content request responses may be handed off to some "random" akamai server for response - this creates a security problem for us because we can't lock down our relevant firewall to just allowing the secure interaction with the "known" servers / IP addresses.
If we lockdown, we get a performance hit - waiting transactions spinning until a known server responds; if we don't lockdown, we run the risk of accepting content from servers of unknown (from our point of view) security disposition.
Does your architecture provide a solution for this situation ?

Thanks

Paul

Paul,

Thanks for reading the blog and for the comment. I need to check on an answer for you. Is there a good way to get in touch with you?

Leave a comment