Akamai Diversity

Akamai Security Intelligence
& Threat Research

Data Spaghetti: Throw it at the wall and see what sticks

In the last installment, we introduced the challenge of normalizing a geographic visualization showing the observed number of web application attacks sourced from each country. This time, we'll try to discern which potential normalizing variables could have a significant relationship with the attack counts through exploratory analysis and hopefully gain some new insights.

Inferring Influential Factors

Most of the top 10 source countries of web application attacks shown in the table below are economic and technological powerhouses. As such, these rankings should be no surprise. Waging an attack requires a certain level of knowledge, hardware, connectivity, and time. Thus, residents of countries with high levels of technological innovation, internet usage, and economic output have greater access to these crucial resources.  With this is mind, variables that are indicative of a country's economic strength, level of technological innovation, and prevalence of internet access could be good candidates for normalization of attacks sourced per country.

However, it should be noted that with the advent of automated tools to search for and exploit vulnerabilities, those without a wealth of hacking know-how and free time can now execute attacks. Plus, we should always keep in mind that the data pertaining to source country is not 100% reliable as an adversary can either conceal their own IP address or make use of a compromised device.

Top 10 source countries of web application attacks (Nov 2017 - Apr 2018)

Before we jump into the data, let's go over some of the visualization methods that will help facilitate our exploration.

World Tile Grid Maps

Geographic maps are not perfect visualizations, especially when it's necessary to view and compare several of them. On top of eating up a lot of valuable space, most viewers have a natural tendency to incorrectly assign more weight to the data shown in larger regions. Instead, the data associated with each country should considered as equally weighted observations, i.e. we shouldn't discount or ignore countries just because they're smaller just as we shouldn't solely focus on others just because they're larger. For this specific purpose, surface area has very little to no influence on the number of attacks sourced. We will make use of a world tile grid map, where all countries are represented as squares of equal area, while maintaining the general spatial orientation so it's still recognizable as a map. Since each country is now the same shape, the country codes are added for identification. As an example, here's a world tile version of a map shown in the previous installment -- attack rate per 1,000 people. It's easy to notice the Netherlands as the country with the highest rate of attacks per capita.

Reading a Bivariate Choropleth Map

Typically, a normalized choropleth map means calculating a rate by dividing the featured variable by some contextual variable. The resulting units of that rate are not always easy to conceptualize and may confuse the viewer. For example, attacks per 1000 people is an easy unit to make sense of in the real world whereas attacks per people per km2, attacks*surface area / population, might be more difficult to put into a real world context. That doesn't mean that the relationship between attacks sourced and population density shouldn't be explored. A bivariate choropleth avoids cumbersome units by illustrating the relationship without calculating a rate.

A bivariate choropleth is made possible through relabeling the data using quantile estimates. Quantiles are breakpoints that divide a set of data into sections of equal probability. We'll be using quartile estimates to divide the data into four subsets -- the lowest 25% of values, the second lowest 25% of values up to the median, the second highest 25% of values above the median, and the highest 25% of values. Each observation is labeled according to the quartile that contains its value. Using the quartile labels, each pair of featured and contextual values will be assigned a color from the 2-dimensional color scale shown below.

The hues along the upward diagonal will correspond to countries where the number of attacks sourced is in agreement with the contextual variable. The colors above the diagonal toward the upper left corner will correspond to countries where the number of attacks sourced is proportionally high, colors below the diagonal toward the lower right corner will correspond to countries where the number of attacks is proportionally low. If the two variables are significantly correlated, countries filled with yellow (upper-left corner) and purple (lower-right corner) are the extreme outliers.

As an example, let's revisit population.


From this map, we can see that most countries in the top quartile for attacks sourced are also in the top quartile for population. However, many countries in Scandinavia and Eastern Europe fall in the top quartile for attacks, but below the median for population. Luxembourg, a country small enough that it would have been lost in a traditional map, stands out as the only country in the top quartile for attacks and the lowest quartile for population. Most of the countries where attacks are proportionally low are in developing countries of Africa, with the exception of bright green Liberia.

In the map of attack rates shown previously, the Netherlands stood out as a large outlier, but this bivariate map reveals that the disparity between the two values isn't as great when one considers where the two values fall within their respective distributions. While this perspective does sacrifice precision, it provides an at-a-glance overview of how these variables are related and where any anomalous activity has occurred.

Determining the Appropriate Context

When using normalization for the purpose of identifying outliers, a significant relationship needs to exist between the featured and contextual variables. For example, population maps are often normalized by surface area because no set of regions are of equal size and larger regions naturally have more residents. As a result, normalization would reveal regions with an abnormal population density. The first step, in determining if a variable is appropriate for normalizing the counts of web attacks per country, is to determine if a significant relationship exists between the attack counts and the contextual variable. Previously, we looked at the rate of attacks per population. Let's see if that was a good choice.

Chosen Contextual Variables

The following variables have been selected for the initial exploration:

  • GDP and GDP per capita, current US$ (1)

  • High-technology exports, current US$ and % of manufactured exports (2)

  • Population density, people per square km (3)

  • Number of publications in scientific and technical journals (4)

  • Number of internet users, total and as percentage of population (5)

  • Number of fixed broadband subscriptions (includes any wired high-speed internet connection for residences and organizations, excludes mobile-cellular connections), total and per 100 people (6)

  • Number of secure internet servers (defined as the number of unique, valid, and publicly-trusted TLS/SSL certificates), total and per 1 million people (7)

  • Number of patent applications, from residents, nonresidents, and total (8)
  • Number of trademark applications (9)

Below is a set of scatter plots with total attacks sourced plotted against each variable, along with the Spearman correlation.

Based on the plots and correlations above, GDP (US$), high technology exports (US$), scientific publications, internet users, fixed broadband subscriptions, secure internet servers, patent applications by residents, and trademark applications, with shaded backgrounds above, all have a significant positive association with number of attacks sourced. Notably, the strength of those associations dramatically decrease when they're transformed to a rate or proportion. The transformation of total population to population density causes the correlation with attacks sourced to drop to almost zero.

These decreases in correlation aren't particularly surprising. For instance, the top 10 countries in regards to number attacks sourced and number of fixed broadband subscriptions include both China and India, but both have a low density of fixed broadband subscriptions -- most likely due to their large rural populations. But, the impact of a large number of broadband connections on number of attacks sourced is in no way affected just because a large portion of the population may not have access to them.

It should also be noted that the variables most correlated with the number of attacks are also highly correlated with each other, as observed in the plot matrix below. This suggests that all of these observed variables are heavily dependent on some latent variable.

For now, we'll focus on variables that have a strong association with number of sourced attacks and would yield sensible units if used to calculate an attack rate: fixed broadband subscriptions, secure Internet servers, and Internet users.

Since all three variables have a strong association with the number of attacks, most of the countries are filled with the colors along the upward diagonal, denoting a general agreement between attacks sourced and each contextual variable. The most notable and consistent outlier is Liberia. It's the only country that falls in the top quartile for attacks and the bottom quartile for all other contextual variables.  Compared to the top sources, Liberia's attack total of approximately 931,000 is relatively miniscule, but given some context it's a much larger source than would be expected. We can't discern the exact cause of this anomaly, but given it's small population of less than 5 million this unexpected surge is most likely due to small number of effective adversaries with the means of executing these attacks. Liberia's internet infrastructure was targeted by the Mirai botnet in November of 2016, so it's also possible that some adversaries have targeted them again and made use of compromised devices within their borders to execute some of these attacks.

We could not have gleaned this insight from attack counts alone. These bivariate maps shouldn't replace, but be used in conjunction with, a map of raw counts. Knowing which countries are the largest sources is still vital information, even if the results are exactly as expected.  

For the next and final installment in our survey of potential normalizing variables, we will shift our focus from external data to variables that provide context that is specific to Akamai and the attack data itself.


(1) The World Bank: World Development Indicators: World Bank national accounts data, and OECD National Accounts data files

(2) The World Bank: World Development Indicators: United Nations, Comtrade database through the WITS platform

(3) The World Bank: World Development Indicators: Food and Agriculture Organization and World Bank population estimates

(4) The World Bank: World Development Indicators: National Science Foundation, Science and Engineering Indicators

(5) The World Bank: World Development Indicators: International Telecommunication Union, World Telecommunication/ICT Development Report and database

(6) The World Bank: World Development Indicators: International Telecommunication Union, World Telecommunication/ICT Development Report and database

(7) The World Bank: World Development Indicators: Netcraft (netcraft.com)

(8) The World Bank: World Development Indicators: World Intellectual Property Organization ( WIPO ), WIPO Patent Report: Statistics on Worldwide Patent Activity

(9) The World Bank: World Development Indicators: World Intellectual Property Organization (WIPO), World Intellectual Property Indicators and www.wipo.int/econ_stat