Akamai Diversity
Home > Professional Services > Search Engine Impersonation: The Wolf in Sheep's Clothing

Search Engine Impersonation: The Wolf in Sheep's Clothing

All companies with a web presence want search engines to crawl their sites and index their content because it's the easiest way to drive traffic, improve visibility and increase business.

The fact is that companies want search engines to crawl their site on a regular basis and index as much content as possible. As such, they usually assume that all search engine requests are legitimate and really don't pay much attention to them.
However, not all bots represent legitimate requests to crawl a company's site. Akamai has recently observed a growing number of requests that were not initiated by credible search engines. On average, Akamai sees around 0.65% of all requests coming from bots such as Googlebot or Bingbot. On a typical site seeing 160,000,000 requests that are 1,040,000 requests coming from bots. Customer have very few options when it comes to trying to determine how many of those one million + requests are legitimate. Trying to validate all crawlers in real time by doing a reverse DNS lookup and a forward DNS lookup will have a performance impact that could affect your search engine rankings.

How easy is it to impersonate a search engine crawler?

It's easy to crawl content by impersonating search engine user-agent. Usually all the search engines have Webmaster pages that describe their crawler user-agents.

Crawler                                              User-agents                         HTTP(S) requests user-agent

Googlebot(Google Web search)              Googlebot                         Mozilla/5.0 
                                                                                                  (compatible; Googlebot/2.1;
                                                                                                  +http://www.google.com/bot.html) 
                                                                                                  or (rarely used):
                                                                                                  Googlebot/2.1


Once you have the user-agent string, you can then leverage a browser plugin as User-Agent Switcher for Chrome to impersonate Googlebot.

If you use HOIC (High Orbit ion Cannon) client this will be done by default - by selecting "GenericBoost.hoic" under the Turbo Booster section, you will impersonate Googlebot without even knowing it. Looking at the booster script in details you can see that by default GenericBoost.hoic will perform user-agent and the referrer impersonation.

// populate list
[...]
useragents.Append "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)"
useragents.Append "Mozilla/4.0 (compatible; MSIE 5.0; Windows NT 5.1; .NET CLR 1.1.4322)"
useragents.Append "Googlebot/2.1 ( http://www.googlebot.com/bot.html) "
useragents.Append "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/9.0.601.0 Safari/534.14"
[...]

// populate referer list
referers.Append "http://www.google.com/?q="+URL
referers.Append URL
referers.Append "http://www.google.com/"
referers.Append "http://www.yahoo.com/"

How big of a problem is this really?

To understand the size of this particular search engine impersonation issue, we analyzed 160 million client requests matching Googlebot and Bingbot user-agents across multiple web sites on the Akamai platform:

  • Approximately 530,000 advertised themselves as Googlebot
  • Approximately 510,000 advertised themselves as Bingbot
How many requests were impersonating a search engine?

In order to understand how many of those requests were impersonators, we followed the Webmaster best practices (for Google and Bing) by doing a reverse DNS lookup and a forward DNS lookup.

> host 66.249.66.1
1.66.249.66.in-addr.arpa domain name pointer
crawl-66-249-66-1.googlebot.com.
> host crawl-66-249-66-1.googlebot.com
crawl-66-249-66-1.googlebot.com has address 66.249.66.1
(source Google - https://support.google.com/webmasters/answer/80553)

The results of this verification showed that for the Googlebot and Bing advertised user-agents:

  • 5% of the requests advertised as Googlebot were not from Google
  • 1.2% of the requests advertised as Bingbot were not from Bing
Putting things into perspective

For 160 million requests, 32,500 requests had bots/users impersonating these two search engine crawlers.

Patrice's Post.png
Even if the results vary based on the customer industry and traffic, the relatively low percentage of impersonators suggests that search engine impersonation doesn't seem to be used for volumetric attacks but more for application layers attacks. Application layer attacks are more difficult to detect than volumetric attacks and often times cause more damage. Cross-site-Scripting attacks, for example, can lead to website defacement and brand damage. SQL attacks can lead to data theft and regulatory fines.

The bottom line is that it only takes one of those 32,500 requests to infiltrate and steal or manipulate data. So while business tend to want to view bots as "friendly" to their sites, they need to be aware of how attackers use impersonation to access data. Search engine impersonators are a kind of wolf in sheep's clothing, posing as a bot to help companies raise their SEO ratings, but actually stealing data once they get past the firewall.

How can Akamai help?

Akamai Professional Services team has been actively helping customers to mitigate search engine impersonators in multiple ways, for example:

  • Providing real time visibility in the Akamai Luna Control Center Security Monitor section, allowing customers to see the activity and take the appropriate actions. For instance, defining custom rules to fit a particular customer's application or website, or turn an existing rule from "alert" to "deny" in order to block malicious requests outright.
  • Predefine rules to identify search engine impersonators and redirect them away from customer applications and into a "honeypot" to log their activities, learn from their behaviors, and better protect the customer going forward.
Contact Akamai Professional Services today to arrange a technical call to discuss your strategy to analyze your traffic protection plan and how Akamai can help evaluate potential search engine impersonators activity.

This is a post from Patrice Boffa, director of global service delivery, and Ribhu Shekhar, solutions architect at Akamai.

Leave a comment