Akamai Diversity
Home > Patrick Gilmore

Recently by Patrick Gilmore

An Introduction to Internet Hygiene

Despite the common misperception, there are not a lot of rules for ISPs. There are a lot of things people think are rules or perhaps even are called rules, but in reality, they are merely suggestions.

You may think to yourself, "how can that be?" Especially when things such as "Request for Comments" (RFCs), "Best Common Operational Practices" (BCOPs), and "Internet Official Protocol Standards" (STDs), all spell out the rules for protocols, servers, networks, and even higher level activities. These documents use words like SHOULD and MUST (yes, in all caps), with rigid definitions. However, when an ISP does not follow the rules there is no fine, no penalty, no Internet police to take them off to Internet jail. These "rules" end of feeling more like suggestions or recommendations.

But, they're called rules for a reason, and they do exist to make the Internet a safer, more reliably operating place. And, we know that ignoring the rules can lead to problems, so most ISPs follow most of the rules. But in some situations disobeying the rules does not cause an immediate or massive effect. ISPs may not even realize that something is wrong, even if the impact is large.

And therein lies the problem. The Internet is the largest shared medium in the history of humanity. If the users of the shared medium do not act in a way conducive to the medium's shared fate, it is harmful to all users. There are plenty of examples where a single ISP or a small number of ISPs playing fast & loose with the rules caused major problems for the entire Internet.

Following the rules - keeping your network clean - is considered good Internet hygiene. Complying with all the standards might not be sexy, but just like brushing your teeth, it is vitally important to maintaining good health. Besides, who wants an ISP with rotten teeth and bad breath? Yuck!

Unfortunately, there are literally thousands of RFCs, STDs, BCPs, etc. It can be difficult to figure out which ones apply to each individual situation.

Over the next several weeks, I am going to do a series of posts highlighting the most common things ISPs miss when configuring their network. Each of these actions is relatively low-cost or even no-cost, and will help not only the ISP configuring them but the Internet as a whole.

My initial focus will be on looking at those Internet hygiene issues that can help stop Distributed Denial of Service (DDoS) attacks. DDoS is a scourge on the Internet, almost always harming the intended victim and frequently enlisting the help of unwitting ISPs, which harms those ISPs. Worse, they can harm networks in between the attacker and the victim. There just is no such thing as a good DDoS attack. My hope is these posts will spur into action some ISPs who did not realize that by following the rules they can protect themselves and the whole Internet.

I welcome your comments. The more people who get involved the better, and all ideas are welcome.

 

Patrick Gilmore is chief network architect at Akamai

Distribution: What good is it?

You'll recall from Part I that I describe a scenario where the London node of a network has become overloaded in a DDoS attack. Now you may wonder why if all the users in that scenario going to London are having problems, what good is distribution? Some of you may have already noticed the benefit I glossed over earlier. In our example with a congested London node, the users in San Jose, Tokyo, Sydney, etc. are all unaffected.

This is great news. Not only does distribution make it more likely no node will be overwhelmed, but if one is, there should be lots of others which are not. This minimizes the damage as not all users will suffer during a failure.

At this point you should start feeling sorry for the poor sods going to the London node. Let's see if we can do anything about that.

Overwhelmed nodes: Any way to avoid user pain?

I have tried to convince you that some node or link will get overwhelmed no matter what. But even if you do not believe my logic, the empirical evidence is clear: This has happened, to every CDN. Yes even including Akamai. < I can hear the audible gasps from here. >

If you combine the fact that every CDN has had nodes overwhelmed, and what I said above about users going to the overwhelmed node suffering, the logic seems to say that attacks can and do harm users. Luckily, there is a way to escape this seemingly inevitable fate: Do not send users to overwhelmed nodes.

If only that were as simple as it sounds.

Anycast: How granular is BGP?

Most CDNs use anycast to direct users, either through anycasting their name servers or anycasting their web servers. BGP is a crude tool, lacking granularity and precise control.

Going back to our London example, if a CDN wanted to move traffic off the London node, it has to change something in BGP. If the CDN is anycasting its name server, chances are all it can do is direct traffic away for entire networks. You cannot tell a network "send your east London users to this node's name server, your west London users to Frankfurt's name server" with BGP. Moreover, unless the CDN has multiple prefixes with different name servers in each, it cannot say "send traffic for Customer A to Frankfurt and traffic for Customer B to Amsterdam."

If the CDN is anycasting its web servers, there might be slightly more flexibility. It is possible to send users to London for some web server addresses and Paris for other web server addresses. However, you can still only direct users by network, not sub-group.

Furthermore, many networks require peering partners to do what is called announce consistently. This forces a CDN to announce the exact same thing in BGP to a network in every point where they peer. Without the ability to modify BGP announcements per node, a CDN cannot affect where traffic flows.

Finally, some things in BGP are not black and white. A CDN can remove reachability information, e.g. "you cannot get to web server XYZ in London." But anything short of that, such as "please use Madrid first, and come to London if Madrid is down," is purely a suggestion. The network receiving the BGP announcement is allowed to listen to or ignore any hints provided by the CDN. This means you can say "please use Madrid first, then London" and the peer network might say "no, I'm going to London first." There is nothing the CDN can do other than remove London as a choice completely.

Now, imagine trying to mitigate a massive attack across multiple networks and multiple nodes when the tools you have involve hints which might be ignored or the ability to move traffic from whole network or CDN nodes at once, plus the million other details I did not cover.

Yeah, I don't want to think about it either.

Akamai Mapping: Does it use BGP?

Fortunately for Akamai, we do not use BGP to map users to web servers. Akamai's Mapping System can and does notice overwhelmed nodes in seconds and directs users, regardless of their ISP or internal BGP preferences, to other nodes.

Akamai has many ways of finding problem nodes and fixing them. We send probes out from each node, as well as probes into the nodes. And if that were not enough, we track TCP stats on the node which gives us telemetry on production traffic to real users. Node gets overwhelmed, traffic is moved seconds later automatically. Human involvement is neither required or preferred - people cannot move as fast as computers.

Moreover, Akamai's system is based on DNS, not BGP. We can, and frequently do, direct "east London users to Node A and west London users to Node B" from the same network. Or even "east London users to Node A for Customer Z and west London users to Node B for Customer Y" from the same or multiple networks.

This means even if an attack takes out one of our nodes, the collateral damage is minimal and very short.

On top of that, Akamai has the most traffic of any CDN. By some estimates, we have nearly as much as all other CDNs combined. Having double-digit terabits of outbound traffic means we have to have a lot more than a few hundred Gbps of inbound capacity.

Summary

All these things together make not just serving at the edge, but serving at the edge Akamai style, a great way to fight DDoS.


Patrick Gilmore is a Chief Network Architect at Akamai

Akamai's Chief Security Officer Andy Ellis recently commented on large DDoS attacks and how "size" can be misleading. In that post, Andy notes if you have more than 300 Gbps of ingress capacity, then a 300 Gbps attack is not going to hurt you too much.

He's right of course. However, total ingress capacity is only part of the equation. Also important are the type of attacks you're facing and your "ingress" configuration. I'd like to dig a little deeper into these two topics, and explain how a widely distributed infrastructure is useful for both improving performance and mitigating attacks.

Not surprisingly, I've used a generic CDN example to set the stage, but most of the concepts here apply to any large backbone network with many peering and transit links. Because we are talking about CDNs, we should first ask why CDNs push traffic to the "edge", and even before that, what is the "edge"?

Why serve at the edge?

On the Internet, the "edge" usually refers to where end users ("eyeballs") are located. It is not a place separate from the rest of the internet. In fact, frequently the edge and what most people consider the core are physically a few feet from each other. Topology matters here, not geography.

The reason CDNs want to serve at the edge is it improves performance. Much has been written about this, so I shan't bore you here. The important thing to realize is all CDNs distribute traffic. However, when CDNs were invented, the distribution was not designed to mitigate attacks, it just worked out that way.

And it worked out well. Let's see why.


Ingress capacity: How much is enough?

To set the stage further, we are going to discuss a "typical" DDoS (as if there were such a thing!) and possible mitigation strategies, not a specific attack.

The first and most obvious mitigation strategy is what Andy mentioned in his post: Have enough ingress capacity to accept the traffic, then drop or filter the attack packets. This begs the question of what "ingress capacity" means. If you have a single pipe to the Internet, making that pipe bigger is the only answer. While plausible, that would be very difficult, and very, very expensive to do with the size of attacks seen on the 'Net today.

Now, suppose you have many ingress points, such as a CDN with multiple links and nodes. Do you need to ensure each and every point is large enough for the worst-case DDoS scenario? Doing so would be insanely expensive and, frankly, nearly impossible. Not every point on the Internet can support connections into the 100s of Gbps.

Fortunately, the first 'D' in DDoS is "Distributed", meaning the source of a DDoS is not a single location or network, but spread widely. And because of the way the Internet works, different attack sources will typically hit different CDN nodes and/or links.

The chance of a distributed attack hitting all the same node is very, very small. Of course, the opposite holds as well - the chances of an attack having a perfect spread over all points is essentially nil. As such, you cannot just divide the attack by the number of nodes and assume that amount is the maximum required bandwidth per node to survive an attack. How much more capacity is needed per node depends on the exact situation, and it cannot be predicted in advance. This leads us to our next topic.

Node capacity: How much is enough?

Trying to size each node properly for an attack is an art, not a science. A black art. A black art that is guaranteed to fail at some point.

Of course, everyone still tries. They attempt to estimate what attack capacity is needed based on things like where a node is, what customers are on the system, how much connectivity costs, and several other factors. For instance, a node in Düsseldorf serving a small DSL network exclusively probably does not need as much attack capacity as a large node in Silicone Valley serving several Tier-1 providers combined. Engineers pray to the network gods, sacrifice a router and maybe a switch or two in their names, make a plan, implement it, and... pray some more.

But pray as they might, the plan will fail. Sooner or later, some attack will not fit the estimates, and a node will be overwhelmed with attack traffic. Don't let this get you down if you are making your own attack plan, remember the same is true for everyone. Not only is it impossible to predict how large an attack is going to be, but as mentioned above, it is also impossible to predict what percent of the attack will hit each node. Worse, since unused attack capacity is wasted money - a lot of wasted money - CFOs tend to push for less rather than more, making the plan that much more likely to fail.

The problem with an overloaded node is it doesn't matter how many nodes you have, if one is overloaded, any traffic going to that node will be affected. This means if your link to London is overwhelmed with attack traffic, it doesn't matter how many nodes you have in Tokyo, Sydney, San Jose, etc., your users going to the London node are suffering.

As a result, while CFOs push for less, engineers push for more.

In Part II, I'll cover in greater detail why a distributed infrastructure, such as the Akamai Intelligent Platform, is ideal for mitigating even the largest of DDoS attacks.

 

Patrick Gilmore is a Chief Network Architect at Akamai

What can be done about spoofing and DNS amplification?

There's been a lot of talk about the recent very large DDoS attack against Spamhaus. Although I was quoted in some articles about it, I want to be clear that the attacks did not affect or involve Akamai or our customers. However, we have been the object of similar attacks in the past, and Akamai has a vested interest in making the Internet better - safer, more reliable and higher performing.

Unfortunately, there are some common... let's call them "misconfigurations" on the Internet which make these types of attacks both easier and much more destructive than they should be.

The one most talked about recently is DNS Amplification. This has been discussed many times in many places so I won't go into a great deal of detail. Very generally, a miscreant can send a query (a very small amount of data) to a DNS server and the server will send 100 or more times as much data in response to the victim. It is the equivalent of a stranger sending a postcard to a store and you get a gigantic catalog in return.

Now imagine getting 1000s or even millions of gigantic catalogs delivered to your house - per second. You couldn't even leave your house. What's more, if they send enough, the catalogs could jam your whole street or even neighborhood, causing a significant amount of collateral damage.

ISPs can stop this by ensuring their DNS servers only answer queries from users inside their own network. For example, if I run Patrick's Network, I would ignore any queries from users in Mike's Network. Answering those would be a waste of my resources, but more importantly, it can be used to abuse Mike's network. Even though locking down DNS servers is a good idea, many, many ISPs do not do this. There are currently upwards of 20 million DNS servers configured to answer queries from anyone on the Internet.

While this situation isn't ideal and ISPs should lock down their DNS servers, it is not the root cause of what the problem. The problem is actually source address spoofing. The reason the store sends you a catalog is because the return address on the postcard is yours, not the miscreant's. The store doesn't know you did not ask for the catalog - just the opposite, because yours is the return address, the store - or in the case of the Internet, the server - thinks it is doing you a favor by delivering the catalog.

Combining these two situations can create a couple of serious problems for the Internet. First, it allows a miscreant to get a massive "bang for their buck". A little attack traffic can be amplified 100X. Second, and in some ways more importantly, it hides the true source of the attack. When the victim analyzes the attack traffic, they only see the misconfigured DNS servers' addresses; they do not know the miscreant's address. Either of these reasons is sufficient to require stopping this, but the two combined can be disastrous.

The rules on the Internet (as much as the Internet has rules) state ISPs should check all the packets leaving their network to ensure the source addresses are from their own network. Put another way, ISPs are supposed to stop any postcards which have a fake return address. This is codified in something called a Best Common Practices document, BCP-38.

This problem is not new, and all major router vendors built simple ways to implement BCP-38 into their equipment years ago. However, many ISPs do not set the configuration, allowing spoofed packets to leave their network.

There are several reasons an ISP may not configure their network to disallow spoofing. Let's discuss a few of the most common:

* Lack of Knowledge
A lot of ISPs simply do not know they are supposed to do this, or understand the consequences of not doing it. This is why I am beating the drum, and working to get people to implement BCP-38. It's an important piece of Internet hygiene. To be clear, this is not a silver bullet. Configuring BCP-38 will not stop all attacks on the Internet, but it will help,

* Time & Effort
It takes time and effort to implement BCP-38. ISPs are businesses and they have a strong motive to be profitable. Implementing BCP-38 does not have clear profit or revenue behind it. A CFO may think the time an engineer spends implementing BCP-38 is time that could have been spent doing something to make money. I personally believe this is a false choice. Not configuring BCP-38 opens the ISP to abuse, and not just through DNS amplification. As important, we're dealing with shared fate - if no one does it, everyone is vulnerable; while if everyone does it, no one is vulnerable. A true accounting of the costs will likely show implementing BCP-38 is actually good for the bottom line.

* Risk to Revenue
Configuring filters means the risk of misconfiguring filters, which could in turn lead to filtering legitimate customer traffic. The larger a network, the more likely this scenario becomes. Businesses try hard not to upset their customers. Misconfiguration is a real threat and ISPs need to be careful of it. But everything an ISP does carries risk. This is why ISPs have process and procedure in place to help evaluate and mitigate these potential risks. In addition, it is possible to tie the outbound filters to inbound routing. In that case, the only way to break the outbound filter is to break the inbound routing, which means the customer would not be connected anyway.

In summary, while it may seem like there are good reasons not to implement BCP-38, there really are not. Yet there are many important reasons to implement it. Akamai strongly urges all ISPs to implement BCP-38 as quickly as possible.

 

Patrick Gilmore is a Chief Network Architect at Akamai