Copying content from the Web can be both a good and bad thing. I first wrote that line back in 2014, way before the era of AI that weaponized scraping. It was once a good thing because scrapers were used by data scientists and journalists to track trends and uncover government abuses. One abuse that wasn’t thought of was the massive CTRL-ALT-DEL of government web data, something that the Data Rescue Project is trying to circumvent.
In those early days of the twenty-teens, what we mostly worried about was that someone has copied your Web content and is hosting it as their own elsewhere online. This happened with LinkedIn in 2014, where someone picked up thousands of personal profiles to use for their own recruiting purposes. That is a scary thought, indeed. I have had my own content copied by spurious sites numerous times too.
And lest you think this is difficult to do, there are numerous automated scraping tools that make it easy for anyone to collect content from anywhere, including BrightData and Scapebox. I won’t get into whether it is ethical to use these on a site that you don’t own the content. Some of these attack sites are very clever in how they go about their scraping, with massive numbers of ever-changing IP addresses blocks to obtain their content and avoid firewalls.
When I wrote about this topic in 2014, we were mostly concerned with how the web search engines would scrape your pages to index and categorize and rank things. I even did a video screencast review of a product called ScrapeDefender, which was clearly ahead of its time, so ahead that it was eventually discontinued.
That seems so quaint now that we have the gigantic AI-based hoovering, which is just the latest example of Scraping Gone Bad. To help out, there are a number of security vendors have arisen to offer protection against bad scraping, including Imperva’s Bot Protection and CloudFlare’s Bot Management.
The problem with AI-fueled scraping and training bots is several fold. First, they operate at tremendous scale, something on the order of a well-crafted DDoS attack. This means that you may need to up your protection service to handle this level of traffic.
Second, these scraping bots don’t all operate the sam way. That means defenders develop different mechanisms to try to stop them. Some site operators may be using something more homegrown, such as custom firewall rules that screen on geolocations or IP addresses. That could be trouble, especially if you don’t set your rules properly and then create all sorts of false positive headaches when you protective measures block out the intentional human visitors. To add insult to injury, most bots ignore the dictates in a robots.txt file, a common preventative measure.
Third, even if these bots are scraping your site at lower traffic levels, they may be harder to detect if your site is still delivering pages to its visitors. This is because some major analytics platforms screen bot traffic out of their counting algorithms by default, making analysis and accurate reporting of human visitors difficult. In some cases the bot traffic blends in with the background noise of the human-created internet.
Next, there is the woeful mess of legal action to try to stop the bots. Few jurisdictions have put anything in place to go after the miscreants, even when you can identify who the source is and whether they are governed by law.
Finally, there could be an added cost to using some of these tools. For example, AWS has its own Web App Firewall that can stop bot traffic, but they charge to implement this service. That means you are paying twice to Amazon: once to receive the bot traffic, and again to stop it. And the scraping can hike up your hosting fees, because often the bots are looking at less-visited pages, pages that can’t be served from cache, or because the bot traffic increased usage costs from the hosting provider. Wikipedia found that 65% of its most expensive traffic was coming from bots scraping less-visited pages.
These issues and others were part of a new report which highlights just how hard it is to defend against web scraping at this enormous scale for a very particular audience; the folks who maintain the digital collections of various online museums, libraries, archives, and galleries. This group has worked hard to expand their audience across the world, and often done so with limited resources and using what the report calls building their sites using idiosyncratic and technical architectures. The report surveyed 43 institutions and describes the extent of the problem and outlines some of the countermeasures taken (most of which aren’t effective).
The report poses this question: “iI is in the long-term interest of the entities swarming them with bots to find a sustainable way to access the data they are so hungry for. Will that long-term interest spur action before the online collections collapse under the weight of increased traffic?” This is somewhat ironic: popularity has its price.