Our exploration of methods for normalizing the number of web application attacks sourced by each country has only considered contextual variables from external sources that characterized each country in a context devoid from Akamai, so far. This final leg of the journey will situate the attack counts within a context that is specific to Akamai and the characteristics of the attacks themselves.
Waste No Data
Previously, we've only made a use a single data point: total number of attacks sourced by each country. Thankfully, that's not the only data at our disposal. The attack activity observed from each country can be further described by considering the number of unique targets and their respective industries, the number of unique vectors used and their frequencies, as well as the number of unique autonomous system numbers (ASN) observed as attack sources within each country. With this additional data , we can garner a more in-depth perspective on the characteristics of the attack activity. These variables will also yield sensible units, e.g. average number of attacks per target, and are appropriate for calculating an attack rate and thusly appropriate for a univariate choropleth map.
Let's take a look at how these variables relate to attacks sourced using scatter plots and Spearman's correlation.
Each of these variables is highly correlated with the attack counts. It makes sense that as the number of attacks increases, the target and means of the attacks become more diverse. For the purpose of identifying anomalous activity, there needs to be some evidence of a relationship or trend between the two variables. We can't say what is abnormal until we define normal. Some of the correlations we see here are so high that they may indicate a lack of any significant deviations from the overall trend.
Due to the nature of the data, the variables for unique industries targeted and unique vectors have relatively few possible values -- twenty-four distinct industries and eight distinct vectors. From the scatter plots, we can see that this coarsely groups the observations, with the largest group at the maximum value. Calculating an attack rate with respect to these variables is simply taking large groups of observations and dividing them by a constant. Consequently, normalizing the attack counts with either of these variables will likely produce a map that indistinguishable from a map of the raw counts, which tells us nothing new.
In the unique targets plot, over half of the observations had less than 250 targets are densely grouped. The number of attacks initially grows at an accelerated rate until it steadies once unique targets surpasses 250. This tells us that the attack counts associated with countries with a relatively small number of targets have much wider range of magnitudes (note that attacks counts are plotted on a logarithmic scale) compared to those associated with countries with a large number of targets.
The three maps above are seemingly identical, so those two variables are not great choices for normalization. Let's move on to the other two contextual variables, number of unique targets and unique source ASNs.
The variable with the highest correlation, unique targets, is also indistinguishable from the map of raw counts. The map describing the rate of attacks per observed ASN shows some significant changes from the raw counts. The most extreme being in northern Africa, specifically Tunisia, as well as Liberia -- the biggest outlier we observed in the previous post.
Normalizing by unique targets, unique target industries, and unique vectors used did not reveal any anomalies, but a lack of anomalous behavior is still useful intelligence about the nature of these attacks. Now we know that none of the larger sources are exclusively targeting a single industry or exclusively utilizing a single vector. However, the lack exclusivity doesn't disprove that a source prefers a specific type of target or vector. To find out which countries have this preferential behavior we need a metric to describe the manner that the attacks are spread across targets and vectors.
Measuring Diversity Among Targets and Methods
Another important aspect to consider beyond the number of unique targets or vectors observed, is how the attacks were distributed across those designations. It would be useful to know if the attacks sourced from a country tend to favor a particular target, industry, or means of execution. One way to measure this dispersion across a discrete variable is the Shannon entropy. If the data includes only a single unique value, the entropy is zero, its minimum value. In other words, if one randomly selected a single observation from that data there is zero uncertainty as to its value. The entropy, or uncertainty, increases as the number of unique values increases and as the proportions of those unique values become more equal. Data containing n unique values where each value occurs an equal number of times will have a maximum entropy of ln(n). Thus, as the number of possible unique values increases, the possible maximum of the entropy increases as well. So a country with a low entropy with respect to targeted industries indicates that the attacks sourced from it tended to favor only a few specific ones.
Again, we'll start by looking at the Spearman correlation and scatter plots.
Both entropies across targets and industries are notably correlated with the number of attacks, and also show a few interesting outliers in the upper left portion of the plot. The entropies across vectors and ASN sources have a significantly lower correlation, but this doesn't render them useless for our visualizations.
From the map of entropy across targets, we can see that it is largely in agreement with the number of attacks sourced, with the notable exceptions of Brazil and Japan. Both have relatively low entropy values despite their relatively high values for total attacks and number of unique targets. This indicates that a significant portion of attacks from these countries are directed toward a small proportion of targets. Similarly, the map of entropy across industries in the upper right shows a general agreement between number of attacks, with the same exceptions of Brazil and Japan.
Based on the map of entropy across across vectors in the lower left, we see that Lithuania, Bulgaria, Ukraine, Moldova, the United Arab Emirates, and Thailand all feature a low level of diversity of attack types and could possibly favor just one or two vectors.
In the map of entropy across source ASNs, Belarus, Kuwait, Algeria, and Liberia are highlighted as countries where a specific AS was used much more often and/or the number of unique ASNs associated with the attacks sourced is very low.
Comparing Volume of Attacks to All Traffic Through Akamai
If our goal is to identify countries where the number of attacks sourced is unusually high or low within an Akamai specific context, our best bet is to normalize the attacks counts with some metric that describes the total volume of traffic seen on Akamai's platform from each country. Let's see how the raw attack counts from November 2017 to April 2018 relate to the total number of hits (roughly http requests) to Akamai's edge servers from each country during that same time frame.
The high correlation and plot show a strong relationship between the volume of attacks and total traffic, but there are a few significant outliers that fall above the overall trend where the proportion of attacks seems abnormally high. Let's investigate further using both a univariate and bivariate point of view.
Both maps reinforce the notion that the two variables are generally in agreement. Similar to the findings in the previous post, Liberia stands out as a country where the number of attacks sourced is much higher than one would expect. Netherlands and Moldova also have abnormally high rates, but nowhere near as extreme as Liberia's. Just to emphasize the magnitude this anomaly, the U.S. was the largest source of attacks with over 237 million attacks total and a rate of less than 2 attacks per million edge hits. Liberia produced just under one million attacks, but a rate of ~151 attacks per million edge hits. The two tables below further illustrate the discrepancy.
Considering just the raw counts, the U.S. sourced more than double the number of attacks sourced by the Netherlands. After normalizing the attacks counts by total hits to Akamai's edge servers, the Netherlands is sourcing attacks at a rate approximately twelve times greater than the United States. Reasons for this inflated attack rate could be that the Netherlands has a well-developed and heavily used internet infrastructure that puts a high value on privacy. Consequently, it is known to offer connections that are fast, cheap, and anonymous, making it a popular conduit for attack traffic originating all over the globe. Similarly, Moldova and Ukraine offer some of the cheapest rates for broadband connections worldwide, possibly rendering them too as often used intermediaries for attack traffic.
Based on the two tables above, the attack activity coming out of Liberia, Moldova, the Netherlands, and Ukraine are countries worthy of further investigation due to the high attack rates, especially the Netherlands as it's the second largest source of attacks and a better understanding of its attack traffic could lead to better detection and mitigation of these attacks.
The Fruits of Our Labor
Our quest to provide relevant context to the web attack counts per country via normalization has provided valuable insights into the attack data. We consistently observed that the relatively small source of Liberia actually acted as a much larger source than one would expect relative to the selected normalization variables. We discovered that while the U.S. is by far the largest source of attacks, the rate of attacks with respect to total traffic on Akamai's platform is low compared to other large sources like the Netherlands and Ukraine. We found evidence that attacks sourced from Brazil and Japan tend to favor specific targets and industries. All of these observations would have been impossible to draw from the raw attack counts alone.
The purpose of this endeavor was to not only gain intelligence, but to also emphasize that context is hugely important in understanding and visualizing data. By the necessity of simplicity for easy interpretation, a visualization can only provide a generalized view of a system at work. It is crucial that the added context is thoughtfully selected to ensure that the resulting perspective yields useful intelligence instead of specious judgements.