Wisconsin Web Scraping: The Trouble With Bots, Spiders and Scrapers

With the Q4 State of the Internet - Security Report due out later this month, we continue to preview sections of it.

Earlier this week we told you about a DDoS attack from a group claiming to be Lizard Squad. Today we look at how
third-party content bots and scrapers are becoming more prevalent as developers seek to gather, store, sort and present
a wealth of information available from other websites.

These meta searches typically use APIs to access data, but many now use screen-scraping to collect information.

As the use of bots and scrapers continues to surge, there's an increased burden on webservers. While bot behavior is
mainly harmless, poorly-coded bots can hurt site performance and resemble DDoS attacks. Or, they may be part of a rival's competitive intelligence program.

Understanding the different categories of third-party content bots, how they affect a website, and how to mitigate their impact is an important part of building a secure web presence.

Specifically, Akamai has seen bots and scrapers used for such purposes as:

•    Setting up fraudulent sites
•    Reuse of consumer price indices
•    Analysis of corporate financial statements
•    Metasearch engines
•    Search engines
•    Data mashups
•    Analysis of stock portfolios
•    Competitive intelligence
•    Location tracking

During 2014 Akamai observed a substantial increase in the number of bots and scrapers hitting the travel, hotel and hospitality sectors. The growth in scrapers targeting these sectors is likely driven by the rise of rapidly developed mobile apps that use scrapers as the fastest and easiest way to collect information from disparate websites.

Scrapers target room rate pages for hotels, pricing and schedules for airlines. In many cases that Akamai investigated, scrapers and bots made several thousand requests per second, far in excess of what can be expected by a human using a web browser.

An interesting development in the use of headless browsers is the advent of companies that offer scraping as a service, such as PhantomJs Cloud. These sites make it easy for users to scrape content and have it delivered, lowering the bar to entry and making it easier for unskilled individuals to scrape content while hiding behind a service.

For each type of bot, there is a corresponding mitigation strategy.

The key to mitigating aggressive, undesirable bots is to reduce their efficiency. In most cases, highly aggressive bots are only helpful to their controllers if they can scrape a lot of content very quickly. By reducing the efficiency of the bot through rate controls, tar pits or spider traps, bot-herders can be driven elsewhere for the data they need.

Aggressive but desirable bots are a slightly different problem. These bots adversely impact operations, but they bring a benefit to the organization. Therefore, it is impractical to block them fully. Rate controls with a high threshold, or a user-prioritization application (UPA) product, are a good way to minimize the impact of a bot. This permits the bot access to the site until the number of requests reaches a set threshold, at which point the bot is blocked or sent to a waiting room. In the meantime, legitimate users are able to access the site normally.

Source: https://blogs.akamai.com/2015/01/performance-mitigation-bots-spiders-and-scrapers.html

Wisconsin Web Scraping

Wisconsin Web Scraping

Monday, 16 February 2015

The Trouble With Bots, Spiders and Scrapers

No comments:

Post a Comment