Real-world data on how adding DNS data to a deep learning model increases its effectiveness
By Yael Daihes & Craig Sprosts
These days, big data and machine learning are topics of frequent discussion within the security community. While the idea that machine learning algorithms prosper with access to more data is hardly a revelation, we wanted to dig deeper and conduct an experiment using global DNS traffic. More specifically, how helpful is anonymized DNS data from ISPs when it comes to identifying threats in our enterprise customers' networks?
Akamai receives more than 4 Terabytes per day of anonymized DNS data from Internet service provider (ISP) networks in every region of the world. Research and data science teams at Akamai apply machine-learning techniques against this DNS data in order to identify new domains used for botnet command and control (C2) servers.
For example, an algorithm based on a long short-term memory (LSTM) architecture implemented by Akamai's enterprise security research team uses a neural network to discover C2 domains generated by Domain Generation Algorithms (DGAs). This neural network was trained to distinguish between domain generated by algorithms and benign domain names. By taking this approach, we're able to improve the previous common practice of manually extracting lexical features, while exceeding previous performance.
Like any deep learning algorithm, the effectiveness of using a Neural Network depends on access to data. The more real-world data the algorithm can use to identify DGA names, the better it is at discovering new malicious domains. We measured the impact adding recursive DNS data had on the output of the model. The number of algorithmically generated domains identified by the model went up from an average of approximately 100k per week to more than 2 million domains a week. More importantly, the impact of our output is measured by the number of "events", where each event is a query sent by Akamai's Enterprise Threat Protector (ETP) customers to a domain identified as a DGA domain by the LSTM algorithm. Each event is an indicator of compromise (IOC) that effectively "signals" to an enterprise that a device is infected.
Prior to January 10, 2019, Akamai was using a sample of approximately 13.5 billion queries per day to predict malicious names from traffic. Starting on January 10, Akamai added an additional 33 billion DNS queries per day from ISPs in major regions around the world. Regional diversity of the data was critical, as the model began detecting new malware variants and DGA behaviors that were previously undetected when using largely North American data.
As shown in Figure 1, the additional DNS data had an immediate impact on the effectiveness of the model. The number of malicious queries detected by the algorithm went from around 11,000 per day to an average of over 400,000 per day for the following 2 weeks. After a period of elevated activity, the number of events then dropped back to an average of closer to 100,000 events per day, still almost an order of magnitude more than prior to January 10th.
Figure 1: Security events vs. DNS traffic data over time
Another way to understand the impact of adding DNS data is to chart the volume of queries vs. the number of LSTM events each day, as shown in Figure 2. When the number of DNS queries processed per day increases, you would expect to see the number of LSTM events also increase as the model becomes more effective. This plot shows the number of events is significantly higher at the higher query volumes (>40 billion queries per day) than the lower query volumes (<15 billion queries per day).
Figure 2: Security events vs. DNS traffic volume
Avg. Queries New names/week Events/week
Before adding data 13.0 ~100,000/week 78,498
After adding data 53.7 > 2M / week 1,489,562
While the number of security events identified is clearly higher when more data is used to feed the model, it's also evident that the number of events per day varies substantially based on factors other than query volume. Is this variance the result of the emergence of new botnets, changes in behavior from existing botnets, or something else? Algorithms such as the LSTM are trying to detect phenomena based on signals from the data and the increase of data generated not only more signals but unique, unknown signals which were now detected due to the diversity of the data used. Akamai's research team analyzed these phenomena to find explanations and hopefully new, unknown botnets.
In our judgement, the large increase in events was attributed to the detection of a high volume botnet that was detected solely due to the new data addition from major regions around the world. Akamai Security Research has also observed several other interesting changes during this time period, including increases in malware families such as Ramnit, Mylobot, and other emerging botnet activity.
For now, it's clear to us that the volume and variety of access to recursive DNS data matters. In a world where malicious domains and websites are used in attacks for hours or even only minutes, using machine learning to identify and block threats as soon as they emerge is a necessity and the success of these models depends on access to the right data.
This experiment demonstrates just how important it is to have diverse large scale data to be able to detect unknown and impactful threats, which is just one of several areas where Akamai is combining its one of a kind diverse data from our Carrier, Enterprise and Web security teams to better protect our customers. We look forward to sharing more about other efforts and our findings in this area in the coming months.