Akamai Diversity
Home > Web Security > SOTI Security Series: Exploratory Data Analysis of a DDoS Database

SOTI Security Series: Exploratory Data Analysis of a DDoS Database

By Richard Willey, Senior Program Manager - Adversarial Resilience

Akamai maintains a database that records information about different attacks it has observed.  The ongoing analysis of that database is captured each quarter in Akamai's State of the Internet Security Report. (Download the Q1 2015 report here.) But even after a report is released, researchers continue to dig deeper into the data and provide updates.

To that end, this article describes an exploratory data analysis exercise of attacks captured by PLX Routed and Proxy DDoS solution scrubbing centers between Q1 2013 and Q1 2015.

Each row in the database represents a unique attack directed against Akamai customers.  The database contains approximately 5,500 unique entries.  

Columns describe information about the various attacks.  This analysis will focus on three of these variables:  Quarter, gbps, and pps.

  • Quarter measures when an attack happened.  We will be looking at nine quarters worth of data stretching from the start of Q1, 2013 through the end of Q1 2015.

  • gbps and pps both measure the intensity of the high watermark of an attack (gbps stands for "gigabits per second")

  • pps stands for packets per second

To start with, I want to get a feel for the shape of the data.  I'll start by generating the simplest possible charts -- plots of gbps and pps over time -- and then proceed to more complex visualizations based on what the initial plots reveal.

Screen Shot 2015-05-26 at 7.35.59 AM.png

Note that the X axis is labeled as "Index".  R uses this convention to indicate that the data is displayed in the same order as the rows in the dataset array.  (I ordered the rows in the database based on the start times of the attacks)

Upon inspection, a couple clear trends emerge:

  1. The overwhelming majority of the attacks are relatively small.  That black smear at the base of the chart is an enormous number of small attacks plotting over one another

  2. There is a spike in the attack size towards the middle of the chart.  We see a number of 200 gbps attacks that don't show up later in the data series.

The graph for pps looks much the same, although there were some large pps attacks towards the start of chart that don't show up on the gigabits per second chart.

Screen Shot 2015-06-03 at 8.06.45 AM.png

Next, I'll generate a simple scatter plot of gbps versus pps and look for obvious correlations between the two data sets.

Screen Shot 2015-06-03 at 8.10.51 AM.png

Recall that each of these data points represents a separate attack that was fielded by one of our scrubbing centers.  If we extend a ray from the origin to any one of these points, the slope of the line segment shows the ratio of gigabits / packets which provides information about the mix of traffic type.  

Even a cursory look at this data shows some intriguing patterns:

  1. Once again, it's clear that most of the attacks are clustered closed to the origin, however, some attacks are significantly larger

  2. The overwhelming majority of the attacks fall within a fairly well defined triangle whose vertex is the origin.  The slope of the rays that define this triangle correspond to the minimum and maximum packet sizes used in attacks

However, the main thing that I am concluding is that this plot is way too busy.  I'm going to start structuring the data and seeing whether I can get a better idea what's going on.  It seems plausible that the nature of DDoS attacks might be changing over time.  (Perhaps attacks sizes are getting larger or the specific techniques being used are changing)  I'm going to use the Quarter as a conditioning variable and generate nine separate scatter plots, each of which shows a different quarter.

Screen Shot 2015-05-26 at 8.06.30 AM.png

There are three things about this chart that immediately grab my attention.

  1. Attack size, especially measured in Gigabits per Second, seems to be growing over time.  In particular, Q3, 2014 has a lot of large attacks

  2. The bottom left hand quarter of each chart is still an awful mess.  There are still too many? attacks plotting over one another.  Using a log scale might help, but I prefer not to use log scales on multiple axes if I can help it.  (I'm probably going to start segmenting the data)

  3. When I look at the center chart (Q1 2014), I can see a bunch of data points running in what looks to be a straight line with (roughly) a 45 degree slope.  This suggests that there are a bunch of different attacks of very different size all of which exhibit the same ratio of gbps:packets.  

Let's start by looking at the last bullet point.  It appears as if a significant number of the attacks that we observed in this quarter have a near constant ratio of gbps/pps.  I created the next chart by calculating the average size of the packet size for each attack, and sorting them from smallest to largest.  As you can see, there is a clear plateau somewhere around 485 bytes.

Screen Shot 2015-05-26 at 8.44.02 AM.png

When I went back and examined the pcap files for these attacks, it was immediately obvious that these attacks were NTP reflection attacks consisting of an enormous volume of 482 byte long frames.  (There is some statistical noise because the "good" traffic is included in the trace files and is impacting the mean).  Personally, I find this result really exciting.  Any time that I can create a chart where I can eyeball a large number of attacks and make useful inference about the traffic type is a good day.

It's probably time to bite the bullet and try to understand the central tendencies of the data, by which I mean "Let's create some charts that describe what 'typical' attacks look like".  Many people would start by looking at the mean (average) attack size by quarter, however, there are a few reasons why I prefer to not to.

  1. We know that the dataset contains some significant outliers and the mean is relatively sensitive to outliers.  

  2. We have reason to believe that the data might be autocorrelated.  

  3. The data doesn't exhibit constant variance (the data set exhibits heteroskedasticity)

All of which suggest using alternative measures of central tendency such as the median and the interquartile range.

I'm going to jump right to what I (hope) is the best plot to visualize the data.

  1. I'm going to generate a series of box and whisker plots to show the interquartile range and the median of the data.  The data will be conditioned by quarter so we can see how these values change over time.

  2. I am going to overlay the various attacks that took place each quarter so we can better see how the individual events correspond to the measure of central tendency

  3. The data will be displayed using a log scale so we can see both the "small" attacks as well as the large attacks on the same graph

  4. The boxes are notched so we can see whether there are statistically significant changes in the median.

(Please note, I trimmed this dataset to exclude a small number of unusually small attacks whose size was less than 1/1000th of a gigabit per second)

Screen Shot 2015-05-26 at 10.44.13 AM.png

Here's what I see when I look at this data:

  1. The first four quarters (Q1 2013 → Q4 2013) look pretty similar to one another.  The upper boundary of the IQR is roughly the same.  Three of the four medians statistically similar to one another.

  2. Somewhere between Q4 2013, and Q1 2014, things change.  The upper bound of the IQR has increased significantly (recall, this is a log scale), as has the median.

  3. In Q4 2014, things change once again.  Here we see a decrease in the upper bound of the IQR without an accompanying decrease in the median attack size.

In addition to this, if you look at the number of attacks by quarter, you'll see that this is increasing over time.  There are a lot more attacks in Q1 2015 than in Q1 2013.  I am bringing this point up because I want to caution people about misinterpreting the data.  In Q1 2013 Prolexic was an independent company.  In Q1 2015 Prolexic is part of Akamai with a much larger sales channel.

At this point in time, I am going to go back and create a couple more charts.  To start with, I am going to once again chart gbps x pps conditioned on Quarter, however, this time I am going to filter the data set to remove all of the large attacks (with "large being attacks greater than 20 gbps in size).  I am hoping that excluding the large attacks will allow me to better visualize the bulk of the data.

Screen Shot 2015-05-26 at 2.23.51 PM.png

This chart shows a lot of interesting patterns.  I am particularly excited by the fact that so many of the attacks fall on to a couple well defined rays.  Recall, all of the attacks that fall along a ray have the same slope and exhibit the same ratio of gbps:pps.  One of the rays (the one circled in a red oval) is prominently featured in all quarters from Q3, 2013 on and is - arguably - visible in the earlier quarters as well).  The red oval corresponds to attacks with an average packet size hovering around 300+ bytes and corresponds to attacks that are labeled as Simple Service Discovery Protocol floods (SSDP).  It's also possible to see a new ray - identified with a blue oval - emerging in the data.  This ray corresponds to Syn flood attacks.  Syn floods have (obviously) been around for a long time, however, this visual suggests that the size of the Syn floods has been increasing significantly over the last three quarters.

The last chart is going to be an alternative representation of the same type of information that can be found in the sorted packet size chart.  This time around, I am going to run a kernel smoother over the data and plot the resulting chart.  (For those of you who haven't run across kernel smoothers before, think of this chart as a histogram with an enormous number of very narrow bins. The area under the curve integrates to 1 and the height of the curve gives information about the relative frequency of the different packet sizes. I personally find this a more useful way to look at the data.)

Screen Shot 2015-05-26 at 1.49.43 PM.png

The data set is clearly showing a bimodal distribution.  Here, once again, we can easily spot both the Syn floods (the leftmost hump) and the SSDP floods (the right hump).

I hope that folks have enjoyed this little exercise. Please email us at  stateoftheinternet@akamai.com and let us know if there are additional visualizations you might find of interest, and keep an eye on stateoftheinternet.com for periodic updates of data visualizations from the SOTI security team. 

Leave a comment