Akamai Diversity

The Akamai Blog

Identifying and mitigating unwanted bot traffic

All websites connected to the public Internet receive bot traffic on a daily basis.  A recent study shows that bots drive 16% of Internet traffic in the US, in Singapore this number reaches 56%. Should you be worried? Well, not necessarily. Not all bot traffic is bad, and some of it is even vital for the success of a web site. Web sites are also affected differently depending on the profile of the company, the value of its content and the popularity of the site.

 

Defining bots

What are the different types of bots?

  • White bots (good) like search engines (Google, Bing and Baidu) help drive more customers to the site and therefore increase revenue. They also help monitor the site availability and performances (Akamai site analyzer, Keynote, Gomez) as well as pro-actively look for vulnerabilities (Whitehat, Qualys)
  • Black bots (bad) send additional traffic to the site that may impact its availability and integrity. Bad bot traffic can drive customers away from the site, negatively impacts revenue and the web site reputation. For example Hackers trying to bring down a site with a DDoS attack or exposing / exploiting vulnerabilities. Competitors or other actors scrapping a site to harvest pricing information to be used for financial gains.
  • Grey bots (neutral) don't necessarily help drive more customers to the site, nor do they specifically seem to cause any arm. Their identity and intent is more difficult to define, they usually present characteristics of a bot but are usually non aggressive. Such traffic would only occasionally cause problem due to a sudden increase of the request rate.

 

Identifying bot traffic

Dealing with bot traffic can be challenging and pro-active measures should be taken to prevent any negative impact on the site. Monitoring bot activity is key. The one thing that all bots have in common is that they only request base HTML pages, which usually contain valuable information but are also more process intensive for the web server to generate. Bots generally never request any of the embedded objects (images, JavaScript, Cascaded style sheets) just because the client doesn't need to render the full page.

 

Now that we know how to find the bot traffic it is necessary to identify the different types of bots.

  • White bot traffic is usually predictable. It will have a specific header signature and will come from IPs belonging to the companies managing the bot. It is possible to control what these bots can request on the site through robot.txt or through the administration interface of the service managing the bot activity.
  • Black bots header signature will widely vary from exactly mimicking a genuine browser or search engine request to something that will present several anomalies with missing headers or atypical headers being present in the request. Black bots may also send requests at a higher rate.
  • Grey bot traffic can be more challenging to identify since it generally present the same characteristics as black bots.

 

In order to effectively identify bot activity it is necessary to implement and deploy a set of rules to look at the traffic from different perspective. Several features of the Kona Site Defender product can help:

  • The WAF application layer control feature consists of the mod_security core rules set and Akamai common rule set. Some of the rules are specifically designed to look for anomalies in the headers or look for know bot signature in the user-agent header value or combination of headers in the request.
  • The rules mentioned above can be complemented with several WAF custom rules to help identify specific header signature.
  • The WAF adaptive rate control feature can also be used to monitor excessive request rate from individual clients.
  • Lastly the User Validation Module (UVM) can be used to perform client side validation during extreme situation when none of the "traditional" methods seem to help.

 

Mitigating bot traffic

Once bot traffic is identified, the next step is to decide what to do with the black and grey bot traffic. You may decide to just monitor the traffic over time and only take action should the activity become too aggressive and represent a threat for the stability of the web site. You may decide to take action as soon as the activity is identified, regardless of the volume of traffic generated. The type of action taken may vary depending on your business needs:

  • Deny the traffic: this is the default but least elegant solution; client will receive a HTTP 4xx or 5xx response code. This will give the bot operator a clear indication that such action is not allowed on the site and that they've been identified by some security service or device. Bot operators could vary the format of the request and see if they can stay under the radar.
  • Serve alternate content: the content served could vary from a generic "site unavailable" page to something that looks like a real response but only containing generic data. This strategy may slow down the bot operator and keep them in the dark as too why they cannot access to the data they want.
  • Serve a cached / stale / static / version of the content: This is the best strategy of all but not always necessarily possible to implement, some content just cannot be cached or stored as static data on an alternate origin, because of compliance concerns or its dynamic nature.  It could potentially take the bot operator some time to realize the data they are getting is worthless, an attacker running a DDoS against the site would also get discourage and move on to a different target.


David Senecal is senior enterprise architect at Akamai.  Patrice Boffa is a director of global service delivery at Akamai.



2 Comments

Great summary on bot traffic! Here's some additional helpful detail on identifying and filtering traffic created by Keynote's measurement agents.

A Keynote real browser agent can be filtered using the keyword "KTXN" which would look this way in the user agent string:

Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; KTXN B502737380A48550T1060097)

Note: the string which follows KTXN is dynamic. It contains a timestamp and unique identifier for the measurement being run by the agent.

A Keynote emulated browser agent can be filtered with the keyword "Keynote" and look like this in a user agent string:

Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Keynote)

Hey!

Thanks for sharing this post (and tips). Recently, been hit pretty bad with bot traffic and really pulling my hair because of that!

Appreciate it especially for the explanation!
~Reginald

Leave a comment