- Also see: "Security Front and Center at Akamai Edge 2013"
- The nasties that do nothing but weigh down your infrastructure (low usefulness, high load on resources).
- Those that can be useful to your business if properly directed. (These fall into the category of high usefulness, but with lower or higher loads.)
APIs are good for, among other things, setting up online partnerships with resource sharing. A business solution to mitigating the effect of high-load, high-usefulness crawlers is to offer an API to the entity if the opportunity arises. This is typically a much more efficient way to receive pricing data than crawling your website.
If there's no way to make a partnership, periodically creating static versions of your sites and directing bots to those sites will lighten the load on your infrastructure. A bot will not interact with a dynamic website the way a user would, so there is no need to show them one.
An alternate technical solution is to set up network rate limits for aggressive bots, especially if they're not very useful to you.
Another way to slow down bots is through browser testing -- planting a javascript "puzzle" the crawler needs to solve in order to proceed. If a bot isn't running a javascript engine, it won't be able to get through. Even if it has such an engine -- some do -- it effectively rate limits the bot by causing it to spend more CPU resources per request.
A more subtle way to foil web crawlers is to use a spider trap. Here's how it works: Since bots read pages and follow links for data, one way to get them hopelessly lost is by putting in a link that's invisible to the user -- white-on-white text, for example -- that the bot will most certainly see. That link, in turn, leads to dozens of pages with randomly generated data, all having dozens of their own links.
With variations of these techniques in place, the business is now in a much-improved position to sell products online, even in the presence of bots and other crawlers.
Catch the rest of my discussion with Matt next week in the next episode of the Akamai Security Podcast.
Hey!
Thanks for sharing. Talking about bad bots; is it recommended to disable crawlers like from small 'companies'? I noticed my site is being indexed by a lot of different bots and seriously wondering if I should just open the crawl to Google, baidu, yahoo and a few reputed ones only.
Hmm! What say you?
Reginald