Purpose of normalization
Data without context is arguably useless. If some variable of interest has a strong and inherent relationship with another, little understanding of the system can be gained if that relationship isn't considered. This consideration is just as integral to data visualization as it is to analysis. The purpose of any good visualization is to create a useful and insightful perspective of the data that quickly provides the viewer a better understanding of both how the data is varied and, more importantly, why it varies in that manner. By neglecting other confounding factors, a univariate visualization can only show the how and not the why.
This is especially true for geographic visualizations like choropleth maps, which are similar to heat maps, but strictly follow geopolitical boundaries, like national borders. There are a multitude of variables that characterize a geographic region -- physical, demographical, economical, cultural -- that are bound to have some effect on the variable of interest. Maps are simply a coordinate system that only give the viewer the orientation and area of each region, often skewed and inaccurate due to the projection. A classic pitfall of choropleth maps is that they end up resembling a population map if the featured variable itself or the method in which it was measured is influenced by the amount of people occupying each region. To remedy this, the featured variable can be normalized by calculating and plotting the rate per capita. Rescaling the variable into a rate yields a more accurate view of the variance between regions by providing a new metric that describes the density of observations instead of a simple count. Normalization can reveal the severity of a measurement by including some relevant context.
Cyber attacks are produced worldwide and maps are well suited for illustrating the overall distribution of their apparent sources (apparent because there are many methods for an adversary to conceal their true IP address and location). There is a large variance between countries, which makes sense. Someone waging an attack requires a certain level of motivation and access to resources. For example, it's not surprising that the U.S. is consistently the largest producer of web attacks and that North Korea is one of the smallest producers. The U.S. has hundreds of millions of residents, many with easy access to the required technology and knowledge as well as a socioeconomic status that allows them the time and effort to put toward an attack. Conversely, North Korea has less than 30 million residents and virtually none of them have the same luxuries that would allow one to even attempt an attack.
The variance between the U.S. and North Korea is expected, but that expectation requires a lot of information about each one. More importantly, if North Korea was producing a significant number of attacks, it would only be considered anomalous when put into context. Similarly, it's difficult to discern if the U.S. is over or under producing attacks given its means. Maybe the capacity for these attacks is much higher and yet to be realized. A raw count doesn't reveal the severity or density of observed attacks within the potential capacity. Akamai has previously used choropleth maps to report the raw counts of web attacks sourced from each country. By adding some additional context to them via normalization, the severity of attacks and abnormalities can be more easily identified.
Challenges & Caveats
The number of attacks sourced from each country is a very tricky variable to normalize. Examples of ideal normalizing variables would be the number of active and potential adversaries. The first is unknown and very hard to determine since adversaries work hard to remain anonymous. The second is pretty abstract. Of all the factors behind someone attempting and executing an attack, the most influential are their personal beliefs and motivations rather than their know-how and access to resources, which would be incredibly difficult to accurately measure for an individual let alone an entire populus. Some anthropologists and social psychologists have spent entire careers trying to quantify the cultural values of countries, and even the best models created so far are not without criticism. (e.g. Geert Hofstede's cultural dimensions theory) Plus, not all adversaries act under the same motivations. Some of them act in the pursuit of activism, others to gain financial or political power, and some are literally just kids looking for fun.
As briefly mentioned earlier, the data pertaining to the source country is not 100% reliable. There are multiple methods an attacker can use to conceal their true IP address. The granularity of the data is pretty coarse at the country level, allowing some margin of error. It's just as easy for an attacker to appear to be from another country as it is from a neighboring city. Even so, this guaranteed inaccuracy may be useful in examining any found anomalies.
Benefits of a Bivariate Perspective
According to Edward Tufte, "graphical excellence is nearly always multivariate." To demonstrate this principle, consider the following univariate choropleth maps, one featuring total population and the other with total number of attacks produced.
At initial glance, the two measures seem to be in agreement, i.e. more attacks are seen from countries with a higher population. Nonetheless, determining which countries have a higher or lower rate of attacks per person is tedious and imprecise. It requires the viewer to constantly shift their attention and make comparisons using only perceived changes in the color.
Now, let's combine both measures and plot the rate of attacks per 1,000 residents.
It's now more clear which countries have a disproportionately high or low rate of attacks. For example, the Netherlands stands out in bright yellow as a country with a very high rate of attacks per person. However, even though two variables were used to calculate the rate, this visualization is arguably univariate. This map is describing the distribution of a single random variable, the rate of attacks per 1k people. A better visualization would describe the joint distribution of 2 random variables -- attacks produced and population. The pair of plots below illustrate how a multidimensional perspective is more revealing.
This density plot of observed attack rates can be thought of as a smooth histogram. It describes the relative frequency of observed attack rates. The x-axis represents the observed attack rate and the area under the curve between two x-axis values is the estimated probability of observing an attack rate within those values. The peak shows that most of the observed rates are centered around 10 attacks per 1,000 people. While this plot describes the distribution of the attack rates, it does little to illustrate the relationship between the number of attacks sourced and population.
In this 2-dimensional density plot, the same peak from the 1-dimensional plot is evident but it further illustrates that the number of attacks sourced generally increases with population.
A bivariate choropleth map can help achieve this multidimensional view while maintaining the familiar format. A bivariate choropleth map uses quantile estimates of two variables and a 2-dimensional color scale to show where the variables are in agreement, where they differ, and how they differ. As an added bonus, representing the data using quantiles or percentiles achieves similar benefits to using a logarithmic scale while being more accessible to a wide audience.
In the coming weeks, several variables and how they relate to the number of web attacks sourced will be visually explored to better understand their effect on the number of attacks and help determine which are best suited for use in normalization.