Written by Avi Aminov and Or Katz
Imagine you are standing in the middle of a crowded train station and want to have a private conversation with an old friend. You've been waiting for the perfect time to contact him and get some advice on how to move forward with some important life choices.
But you couldn't wait any longer, and now you're on a train platform. There are many people around you. They're watching every move you make and listening to each word you say. You really, really need this conversation to be private!
If it sounds challenging, well, it is! This exact situation is what malware developers are facing when deploying their malicious programs. They need to make sure that the malware is able to contact the Command and Control (CnC) server to get instructions and commands, send regular keep-alives, and exfiltrate sensitive data. This communication must be reliable while preventing others from listening or hijacking their calls.
Yet, as challenging as it sounds, the bad guys found a solution. They created a communication mechanism that abuses the agile and flexible infrastructure of the Domain Name System (DNS). By dedicating pieces of the malware code to produce an endless amount of new domain names daily, each of those randomly generated domains has the potential to become the address of the server that controls the entire malicious botnet - just waiting to be activated on doomsday.
This well-known technique empowers malware developers to build scalable malware botnets that live longer. In fact, it's being used by the majority of modern malware and is known as Domain Generation Algorithms (DGA).
In this article, we will tell the story of DGA. We'll explain when it was first introduced. We'll discuss how it's being used in the wild and what challenges defenders face, and finally we'll address how we can fight back using machine learning and behavioral algorithms.
Historical Overview: Cat and Mouse
It is a known fact that DNS plays a significant role in the interconnected world of malware communicating with their CnC server. The main purpose of the DNS protocol is to act as a translation layer between server address to the server/application name. It creates a many-to-many relationship between computing resources and application names. DNS was designed intentionally as a protocol that allows for easy and frequent re-assignment of domain names to different computer services and devices, on different hosting platforms, running in different countries. Those same DNS capabilities, once exploited by malicious CnC servers, can help them remain evasive and transient while InfoSec defenders try to track them and take them down.
It was obvious that the next step in this game of cat and mouse would be taken by InfoSec defenders. By reverse engineering malware and extracting the domain names being used for communication with CnC servers, InfoSec defenders could take over CnC servers, intercept dedicated CnC traffic, and terminate connections.
It was just a matter of time until the bad guys returned to the game stronger than ever. And they did, with a DGA communication technique that created an overwhelming amount of work for InfoSec defenders.
The DGA communication technique we are talking about was first introduced to the world with the emergence of the notorious Conficker malware back in 2008. The first variant of Conficker (variant A) generated 250 different domain names each day, using the date as a seed that will allow generation of the same random domain names across all malware instances every day. The FBI retaliated by using a reverse-engineered piece of Conficker to register all(!) the domains the malware was using before the malware operators had the chance to do so themselves. As the Conficker malware evolved, introducing another 4 malware variants (B to E), the usage of randomization techniques also evolved. Variant C included generation of over 50,000 domains per day from which only 500 would be randomly chosen to try to communicate with the CnC server.
The ability to generate an endless amount of random domain names daily makes the work of taking over malicious domains by InfoSec defenders nearly impossible.
Once the bad actors that developed the malware want to control it, they only need to register one of the domain names expected to be generated. At that point, botnet members will start communication and authentication with that newly activated CnC server.
Technical Overview: How are the Domains Generated?
It is time for us to briefly present how the domains are generated. We will describe the flow for Conficker variant C - E. Naturally, things have evolved since the introduction of these variants, but the essence of the method is roughly the same.
First, the current system date is read and is used as a random seed for a pseudo-random number generator (PRNG). This enables the malware to generate different domain names each day.
For each domain, a length of 5-8 characters is randomly selected. That number of characters is then randomly chosen, a-z, and assembled. And finally, a domain extension (TLD) is randomly selected and added.
Once the list of domains is available, the malware randomly selects a portion of them and tries to contact every one of them.
Figure 1 - Conficker (variant C - E) domain generation flow
Other DGA variants use various techniques to generate DGA domains such as:
- Fixing some of the characters - for example fix the second character to 'h'
- Drawing syllables instead of single characters - this generates readable, yet still nonsensical domain names
- Drawing words from a dictionary instead of characters and syllables
Deep Dive: Detecting DGA Activity
Research done by the Akamai Enterprise Security Research Team spotlights the challenges associated with the process of tracking and eliminating DGA activity.
The case study below started with the analysis of four different (allegedly) unrelated small networks suspected to be part of a bigger botnet.
The first step of the analysis was to generate a network graph that represented the connectivity between each network (green) to accessed domain (purple) to try to visualize the inner relationship between those networks.
The domains being accessed belonged to a variety of industries across the Internet: social media, news portals, advertisement, and entertainment. Even Akamai's edge servers were being accessed.
It is interesting to point out that security vendors' APIs were also accessed by these networks, meaning there were devices being protected by endpoint security products on at least one of the four analyzed networks.
Figure 2 - Entire network graph - green == users, purple == domains
The second step of the analysis was to run a set of algorithms that would help us identify DGA activit. Some examples of the algorithms that were used:
- Lexicographical order: a machine learning algorithm that predicts the probability of a domain or a set of domains being randomly generated by looking at the set of characters being used for the domain names.
- Domains similarities: a set of graph algorithms that measure similarities between domains by considering properties of the domains and connectivity features and clustering domains that share similar properties and features.
These behavioral algorithms gave us the ability to identify unfamiliar DGA activity and even catch new malware variants that weren't seen before.
Here is an example of the same network activity, but this time we've labeled in blue all the domains that were identified as randomly generated.
Figure 3 - Entire network graph, DGA labeled - green == users, purple == benign domains, red == DGA domains
Validating the identified DGA domains across different data sources revealed that this was the activity of a well-known Windows malware named Bedep. This malware establishes a connection with its CnC server, opens a backdoor on the compromised computer, and downloads additional files from external resources.
When visualizing only the DGA labeled domains' activity, we can see access to more than 250 different domains. The image below shows the strong centered inner relationship between the infected resources and the DGA domains, and it shows how the majority of domains were accessed by more than one infected user.
Figure 4 - DGA Network graph. All the domains here are DGA domains
Example of the DGA domains that were accessed:
In the above presented case study, the analyzed malware used a "simple" DGA technique, where the number of generated domains is limited to a few hundred domains and all the infected machines try to reach out to the same generated domains.
But the challenges in front of us are much harder; we need to fight against DGA techniques that generate thousands of domains each day. Furthermore, each malware on the given infected machine randomly chooses to access a small fraction of the generated list, making the domain intersection between different infected machines limited.
Figure 5 - DGA Network graph with partial overlap
In the image above, we can see the yellow nodes that represent users and the blue nodes that represent DGA domains belonging to the same campaign. The partial overlap between domains across different users shows the challenge that defenders face in tagging the users as infected with the same single campaign.
Moreover, there are other variants that use vocabularies and connect random words together to create domain names. These are harder to detect algorithmically because the domains look readable and sometimes even make sense to the point that a human could not tell the difference between benign and malevolent (random) domain names.
To fight these new challenges, we need to gain wider visibility of infected machines and networks, break down the behavior of networks to the smallest pieces, and build algorithms that will find relationships and similarities that humans can't detect on network graphs.
Another issue presented in this case study is related to the challenges that we face when fighting infected devices in our network. Endpoint security is a great solution for protecting against threats such as Bedep, but today's networks include a large variety of devices. Some just can't run endpoint security (IoT); others are visitors (BYOD).
Having visibility to the entire network activity empowers us to find those blind spots and add another layer of security.
It's Time to Fight Back
The overwhelming number of DGA domains being generated by an endless number of new malware variants and campaigns creates a challenge for all of us in the InfoSec defenders community.
It's time to fight back. Instead of being reactive to the threats that are in front of us, we need to become proactive. We need to design and build algorithms that can identify those randomly generated domains and detect abnormal DNS queries to CnC servers. Moreover, we need to leverage visibility to reach diverse and continuous DNS data, using crowdsourcing techniques and looking for patterns of distributed malware across different regions, states, and countries.
It's time to outsmart threat actors that enhance DGA techniques and challenge our detection capabilities; it's time to get a new point of view on the problem we are facing. We need to find new paths for resolution, ones that will detect malicious communication and terminate malware longevity.
As a community, InfoSec defenders need to work together, share information, and ensure we are taking all the right actions that will allow us to win this fight, at least until the next round...
To learn more, reach out to your account team or visit https://www.akamai.com/dns.