In this article we'll review how to handle known bot traffic.
As discussed in the first part, you may not be comfortable serving content to all legitimate bots for various reasons. But even when you're willing to serve content to known bots, several options are available. Just like for unknown bots, you'll have to decide on the response strategy that works best for you.
To help decide which response strategy to adopt, think about how much tolerance you have for serving stale content as well as your appetite for risk of those "known bots" occasionally overwhelming your origin. Yes, some of those legitimate bots can sometimes be aggressive.
If you are running Bot Manager alongside the Kona Side Defender (KSD) or the Akamai WAF solution, and you want to ensure that the rate control, network control and client reputation features configured in KSD / WAF don't interfere with the bot activity at any time, you should consider using the "Allow" action. This is especially useful for the Web Search Engines or Online Advertising categories to help preserve your site SEO and ad ranking.
However, if you want some protection against bots making excessive request rate or known bad IP, you should consider the "Monitor" action. In this case the reputation, rate control and network control defined in the KSD / WAF configuration will apply. This could help prevent the known bot activity from making the web site unstable but unlike the "allow" action, it could affect the site ranking when used with the Web Search Engine and Online Advertising category.
In both of the above cases, the caching strategy defined in the content delivery configuration applies. Some of the content being requested by bots may not be cached by default for various reasons. You always want to serve the most up to date content to your real users but it may be acceptable to serve stale content to bots. If this is the case, the "Serve from cache" strategy may be the most appropriate. If you adopt a long caching TTL (a least 1 day) for the content served to bots, this will help offload your origin, have it focus in servicing real user traffic and overall improve site performances.
Now for the categories of known bots that you do not want to serve content to, it's OK to block them. It's very unlikely that they will morph because these bots are built to fulfill a business process and work in the public interest. They may target a specific industry but they would unlikely target a specific company. Therefore, if you were to block them, they will not try to change their signature. As an alternative to blocking, you may consider a friendlier approach and let the bot know that it is not welcome to your site. Most known bots follow the directives from robots.txt, a de-facto standard used by most legitimate bots. Their level of understanding of the different directives may vary. For example, most would understand the "disallow" directive but not more recent ones like "allow" or "crawl-rate". Therefore, it is best to stick to the basic directives and mostly use "disallow" to avoid any misunderstanding. More details on the robots.txt standard can be found here:
Here's an example of directive that should be added to the robots.txt to tell a bot not to crawl a site, (in this example ):
To tell a known bot not to crawl a specific URL path:
For more information about the known bot categories, please review the Bot Manager user guide available in the Luna Control Center or talk to your account representative.
Read the first two parts of the series here: