Akamai Diversity

The Akamai Blog

Bots, Crawlers Not Created Equally

A few months ago, Akamai Senior Enterprise Architect David Senecal wrote a post about ways to identify and mitigate unwanted bot traffic. Here at the Akamai Edge conference in Washington D.C., discussions around that continue -- specifically, how to squeeze the maximum usefulness out of bots and other Web crawlers.

Yesterday, I continued a discussion I've been having about that with Matt Ringel (@ringel on Twitter), an enterprise architect in Akamai's Professional Services team. (Check out Matt's recent post, "You Must Try, and Then You Must Ask.")

The first order of business was to throw cold water on the notion that all bots are the work of bad guys. 

"People think of bot armies descending on your site like locusts, killing your performance and wrecking your infrastructure," Matt said. "But in terms of commerce and the ability to do things like making price comparisons, some bots will give people faster access to your information, which is worthwhile in certain contexts."

To start down that road, let's break bots down to two categories:

  • The nasties that do nothing but weigh down your infrastructure (low usefulness, high load on resources).
  • Those that can be useful to your business if properly directed. (These fall into the category of high usefulness, but with lower or higher loads.)
Let's say you have a site that sells LED flashlights and you want potential customers to find you within seconds of a Google search. Price-comparison bots can help Google's own crawlers find you more quickly. Then Google can tell the user to "buy LED flashlights from these sites," including yours, and -- if you're lucky -- starting with yours.

For businesses, the question is how to get to "high usefulness, low load" as often as possible. That's where using an application programming interface (API) comes in handy. 

APIs are good for, among other things, setting up online partnerships with resource sharing. A business solution to mitigating the effect of high-load, high-usefulness crawlers is to offer an API to the entity if the opportunity arises. This is typically a much more efficient way to receive pricing data than crawling your website.

If there's no way to make a partnership, periodically creating static versions of your sites and directing bots to those sites will lighten the load on your infrastructure. A bot will not interact with a dynamic website the way a user would, so there is no need to show them one.

An alternate technical solution is to set up network rate limits for aggressive bots, especially if they're not very useful to you. 

Another way to slow down bots is through browser testing -- planting a javascript "puzzle" the crawler needs to solve in order to proceed. If a bot isn't running a javascript engine, it won't be able to get through. Even if it has such an engine -- some do -- it effectively rate limits the bot by causing it to spend more CPU resources per request.

A more subtle way to foil web crawlers is to use a spider trap. Here's how it works: Since bots read pages and follow links for data, one way to get them hopelessly lost is by putting in a link that's invisible to the user -- white-on-white text, for example -- that the bot will most certainly see. That link, in turn, leads to dozens of pages with randomly generated data, all having dozens of their own links.

With variations of these techniques in place, the business is now in a much-improved position to sell products online, even in the presence of bots and other crawlers.

Catch the rest of my discussion with Matt next week in the next episode of the Akamai Security Podcast.

1 Comment

Hey!

Thanks for sharing. Talking about bad bots; is it recommended to disable crawlers like from small 'companies'? I noticed my site is being indexed by a lot of different bots and seriously wondering if I should just open the crawl to Google, baidu, yahoo and a few reputed ones only.

Hmm! What say you?

Reginald

Leave a comment