Additional research and information provided by Asaf Nadler
Recent changes to the Pykspa v2 domain generation algorithm (DGA) have made it more selective. Akamai researchers have tracked these changes and believe that part of the reason for selective domain generation is to enable attackers to keep a smaller footprint online, and remain undetected for longer periods. However, it is still possible to brute-force the DGA and track the domains. In this post, we'll explain the research and how it can be replicated.
Pykspa is a worm that spreads via Skype by sending messages to other Skype users with download links. Once downloaded, Pykspa extracts personal information and communicates with its command and control servers (C2) using a domain generation algorithm (DGA).
By tracking DNS queries made by hosts on Akamai's carrier DNS traffic to Pykspa's algorithmically generated domains (AGDs), we know that there at least 10,000 infected hosts worldwide that query Pykspa's AGDs on a daily basis and potentially more, thus making it a widespread threat (Figure 1).
Figure 1: Akamai's carrier DNS traffic indicates that there's a baseline of at least 10k carrier machines that access Pykspa algorithmically generated domains on a daily basis
The Pykspa DGA has two publicly known versions that are referred to as v1 and v2. The second and latest version of Pykspa generates domain names that are comprised of 6-12 characters, and end with either the ".com", ".net", ".org" or ".info" top-level domains (TLDs). Akamai's enterprise security group recently identified domain names that are consistently queried in DNS traffic and use the same domain name lengths and TLDs of the Pykspa v2 DGA, but with slight change: only a subset of the generated domains are ever queried in DNS traffic.
Identifying Pykspa v2 domain names in DNS traffic
The original Pykspa v2 DGA works iteratively. In every iteration, the DGA outputs a single AGD and the seed and domain length for the next iteration. Formally, Pykspa v2 DGA is a function that maps a seed and a domain length to a single domain.
The cardinality of the input space is the number of possible 32-bit seeds times the number of possible lengths (7) which is roughly 30 billion inputs. The cardinality of the output is the number of possible string combinations of English characters at any of the available lengths which is more than 3 million times larger than the input space. Since the output AGD space is at least 3 million times larger than the input seed space, there's a negligible number of outputs that are mapped to more than a single input and thus we treat the mapping of a seed to a domain name as an "almost" injective function. This observation allows us to generate the entire output AGD space by brute-forcing the input seed space.
Using a technique similar to once we've previously discussed, we performed seed space brute-forcing to generate a set of potential Pykspa v2 domains. We then cross-reference this set of possible domains, with a set of tens of millions of unknown AGDs that were observed on our DNS traffic.
This results in several tens of thousands of domain names that may have been produced by Pykspa v2 and were observed in DNS traffic. We performed several additional steps to increase our confidence that these are indeed Pykspa v2 domains:
Domains must be queried alongside at least 50 other domains within the same hour and by the same user. Pykspa v2 generates at least 200 domain names on a daily basis, thus observing a user that makes at least 50 DNS requests to Pykspa v2 domains provides us with sufficient confidence.
The sum of differences for consecutive seeds of domains for the same user within an hourly time frame must not exceed the square number of domains i.e., seeds are required to be dense in the input space since we know they are used iteratively.
Moreover, we apply several additional verifications, based on degenerate cases of the Pykspa v2 DGA that make it easily distinguishable from other DGAs. One such trademark of Pykspa v2 is that for even lengthed domain names, domain names that end with ".com" or ".org" TLDs will be comprised of only even-indexed characters ("a", "c", "e", etc.). Such verifications further increase the confidence that we're observing Pykspa v2 AGDs.
The missing top-level domain
The implementation of the Pykspa v2 causes it to generate domain names with a distinctive distribution over the set of TLDs. More specifically, when computing the histograms of Pykspa domains that appear in Netlab 360 lists, the TLD distribution converges to 1/8 for ".com" and ".org" and 3/8 for the ".net" and ".info" TLDs.
Recently we've detected a set of users that were querying Pykspa v2 domains that were identified based on the above mentioned method. However, the TLD distribution of their queried domains were different than that of Pykspa v2. More specifically, the newly observed Pykspa domains end with roughly 1/5 ".com" and ".org" TLDs, 3/5 of ".net" TLD and no domains at all under the ".info" TLD (see Figure 2).
Figure 2: The distribution of top-level domains (TLD) of the original Pykspa v2 vs. the newly observed version.
A selective version of the Pykspa v2 DGA
After observing the distinctive TLD distribution, we traced back the set of seeds that generated the newly observed Pykspa v2 domain names and applied them to generate entire sequences of Pykspa v2 AGD. The results indicated that ".info" domains are indeed generated by the original seeds, but are later omitted alongside other generated domains.
We noticed several sequences of Pykspa v2 AGD that only ever queried a subset of domains in traffic (a clear example appears in Figure 3). More specifically, 9/16 of the domains that are generated are ever queried in DNS traffic. The missing ".info" domains stand for 3/8 = 6/16 of the generated domains, and the remaining 1/16 filtered domains are equally omitted.
Figure 3: The selective Pykspa v2 DGA selects a subset of the algorithmically generated domains and to be queried by the malware.
The DNS traffic that was used for this study is worldwide DNS traffic that is constantly used in Akamai Enterprise Threat Protector (ETP) to identify and block new DGA threats. The users that query the "selective" Pykspa domain names appear almost consistently in Indonesia (31.7%), Malaysia (22.5%) and Vietnam (17.7%).
We sampled 1,000 of the 17,029 generated domains that have exactly 12 characters, and checked them against the security engines on VirusTotal. None of the domain names registered as previously detected. A link to a list of all the domains sampled can be found below. We further looked for these domains on Netlab 360 and the results were similar.
Our assumption is that the reason for the omission is a revised version of Pykspa with a reduced footprint, though we cannot be certain.
In order to pre-generate the selective algorithmically generated Pykspa domains, you can apply the Pykspa v2 code with the set of seeds provided below.
Pykspa is a widely spread computer worm that uses a DGA to communicate with its C2 servers. In this post, we've examined a new version of Pykspa that queries a small subset of the domain names generated by the DGA. It's possible this happens so that the threat actors can keep a quieter footprint. The most observed trademark of the new Pykspa version is that the generated domain names never end with the ".info" TLD. This stands in stark contrast to the original Pykspa code, in which 3/8ths of the generated names ended with ".info".
The domain names we found were not detected by any of the engines on VirusTotal, and they are widely used across East Asia. For other researchers and our partners in the security community, we've provided a set of seeds that can be used to pre-generate the set of Pykspa domains and a set of domain names so that your networks can be protected against them.
The set of 17,029 detected algorithmically generated domain names appear on https://pastebin.com/s9XtjEFW