Stop Web Scraping Now!

Copying content from the Web can be both a good and bad thing. I first wrote that line back in 2014, way before the era of AI that weaponized scraping. It was once a good thing because scrapers were used by data scientists and journalists to track trends and uncover government abuses. One abuse that wasn’t thought of was the massive CTRL-ALT-DEL of government web data, something that the Data Rescue Project is trying to circumvent. 

In those early days of the twenty-teens, what we mostly worried about was that someone has copied your Web content and is hosting it as their own elsewhere online. This happened with LinkedIn in 2014scrape dashboard2, where someone picked up thousands of personal profiles to use for their own recruiting purposes. That is a scary thought, indeed. I have had my own content copied by spurious sites numerous times too.

And lest you think this is difficult to do, there are numerous automated scraping tools that make it easy for anyone to collect content from anywhere, including BrightData and Scapebox. I won’t get into whether it is ethical to use these on a site that you don’t own the content. Some of these attack sites are very clever in how they go about their scraping, with massive numbers of ever-changing IP addresses blocks to obtain their content and avoid firewalls.

When I wrote about this topic in 2014, we were mostly concerned with how the web search engines would scrape your pages to index and categorize and rank things. I even did a video screencast review of a product called ScrapeDefender, which was clearly ahead of its time, so ahead that it was eventually discontinued.

That seems so quaint now that we have the gigantic AI-based hoovering, which is just the latest example of Scraping Gone Bad. To help out, there are a number of security vendors have arisen to offer protection against bad scraping, including Imperva’s Bot Protection and CloudFlare’s Bot Management. After I wrote this blog, CloudFlare began blocking all AI-based scraping by default.

The problem with AI-fueled scraping and training bots is several fold. First, they operate at tremendous scale, something on the order of a well-crafted DDoS attack. This means that you may need to up your protection service to handle this level of traffic.

Second, these scraping bots don’t all operate the sam way. That means defenders develop different mechanisms to try to stop them. Some site operators may be using something more homegrown, such as custom firewall rules that screen on geolocations or IP addresses. That could be trouble, especially if you don’t set your rules properly and then create all sorts of false positive headaches when you protective measures block out the intentional human visitors. To add insult to injury, most bots ignore the dictates in a robots.txt file, a common preventative measure.

Third, even if these bots are scraping your site at lower traffic levels, they may be harder to detect if your site is still delivering pages to its visitors. This is because some major analytics platforms screen bot traffic out of their counting algorithms by default​, making analysis and accurate reporting of human visitors difficult. In some cases the bot traffic blend​s in with the background noise of the human-created internet.

Next, there is the woeful mess of legal action to try to stop the bots. Few jurisdictions have put anything in place to go after the miscreants, even when you can identify who the source is and whether they are governed by law.

Finally, there could be an added cost to using some of these tools. For example, AWS has its own Web App Firewall that can stop bot traffic, but they charge to implement this service. That means you are paying twice to Amazon: once to receive the bot traffic, and again to stop it. And the scraping can hike up your hosting fees, because often the bots are looking at less-visited pages, pages that can’t be served from cache, or because the bot traffic increased usage costs from the hosting provider. Wiki​pedia ​found that 65% of its most expensive traffic was coming from bots​ scraping less-visited pages. ​

These issues and others were part of a new report which highlights just how hard it is to defend against web scraping at this enormous scale for a very particular audience; the folks who maintain the digital collection​s of various online museums, libraries, archives, and galleries. This group has worked hard to expand their audience across the world, and often done so with limited resources and using what the report calls building ​their sites using  idiosyncratic and technical architecture​s. The report surveyed 43 institutions and describes the extent of the problem​ and outlines some of the countermeasures taken (most of which aren’t effective).

The report poses this question: “iI is in the long-term interest of the entities swarming them with bots to find a sustainable way to access the data they are so hungry for. Will that long-term interest spur action before the online collections collapse under the weight of increased traffic?” This is somewhat ironic: popularity has its price.

CSOonline: CNAPP buyer’s guide: Top cloud-native app protection platforms compared

It is time to re-examine my review of cloud native protection products, commonly known as CNAPP. The category has expanded to include more devsecops coverage, such as API and supply chain security, and more posture management tools for tracking data and SaaS apps.

The category is also under scrutiny because the CNAPP vendor landscape has shifted, most notably around Wiz. They recently were purchased by Google, who will maintain it as a separate division. Check Point Software has formed a strategic partnership with Wiz, and has discontinued selling its own CloudGuard CNAPP and will migrate its customers to Wiz. Lacework has been purchased by Fortinet and is now called Lacework Fortinet FortiCNAPP. Palo Alto Networks has rebranded and reconstituted its CNAPP offering as part of its Cortex Cloud product line.

My review for CSOonline has been updated to include 11 CNAPP vendors. 

Can Movable Type become a useful AI writer’s tool?

Once upon a time, when blogs were just beginning to become A Thing, the company to watch was Six Apart. They have blogging software called Movable Type. Then the world shifted to WordPress, and soon there were other blogging platforms that turned Movable Type into the Asa Hutchinson of that particularly market. (What? They are still around? Yes and account for about one percent of all blogs.)

Well, Asa no more, because the company has fully embraced AI in a way that even Sports Illustrated (they recently fired their human writers) would envy. If you have never written a book, you can have a ready-made custom outline in a few minutes. All it takes is a prompt and a click. You don’t even have to have a fully-formed idea, understand the nature of research (either pre- or post-internet), or even know how to write word one. (There are other examples on their website if you want to check them out.)

MovableType’s AI creates “10 chapters spanning 150+ pages, and a whopping 35k+ words” (or so they say) of… basically gibberish. They of course characterize it somewhat differently, saying its AI output is “highly specific & well researched content,” It isn’t: there are no citations or links to the content. The output looks like a solid book-like product with chapters and sub-heads but is mostly vacuous drivel. The company claims it comes tuned to match your writing style, but again, I couldn’t find any evidence of that. And while “each chapter opens with a story designed to keep your readers engaged,” my interest waned after page 15 or so.

Perhaps this will appeal to some of you, especially those of you that haven’t yet written your own roman a clef. Or who are looking to turn your online bon mots into the next blockbuster book. But I don’t think so. Writing a book is hard work, and while it is not growing crops or working in a factory, you do have to know what you are doing. The labor involved helps you create a better book, and the process of editing your own work is a learned skill. I don’t think AI can provide any short cuts, other than to produce something subpar.

I have written three books the old fashioned way: by typing every word into Word. Two of them got published, one got shelved as the market for OS/2 moved into the cellar from the time of the book proposal. I got tired of rewriting it (several times!) for the next big movie moment of IBM’s beleaguered OS that never happened. The two published books never made much money for anyone. But I did learn how to write a non-fiction book, and more importantly, write an outline that was more of a roadmap and a strategy and structure document. This is not something that you can train AI to do, at least not yet.

When I read a book, I cherish the virtual bond between me and the author, whether I read my go-to mystery fiction or a how-to business epic. I want to bathe in the afterglow of what the author is telling me, through characters, plot points, anecdotes, and stories. That is inherently human, and something that the current AI models can’t (yet) do. While MovableType’s AI is an interesting experiment, I think it is a misplaced one.

CNN Underscored: Best mobile payment apps reviewed

Mobile payment apps can be a convenient way to send and receive money using your smartphone or smartwatch. Paying for items this way has never been easier, thanks to the availability of numerous mobile payment apps, better payment terminal infrastructure, and wider support for Bluetooth/near-field communication (NFC) contactless credit cards by American issuers. The coronavirus pandemic has also helped to make contactless “everything” more compelling. I tested out five different mobile payment apps: Apple Pay, Google Pay, Samsung Pay, Venmo (by PayPal) and Cash App (by Block, formerly Square) recently, and wrote my review for CNN/Underscored here.

CNN Underscored: Best cloud personal storage apps

It used to be that 1 TB of storage was a lot, but now this amount of storage is quite common to find on even the least expensive laptops. Over the years, a number of cloud-based storage vendors have begun to support the TB era and now many of them offer monthly storage plans for a reasonable price. We tested five different cloud-based storage apps—Apple iCloud+, Box, Dropbox, Google One, and Microsoft OneDrive—to see which one is the best cloud-based storage app for you. OneDrive comes out on top and it was easier to install on Macs than on some of our Windows PCs that had additional browser-based security that blocked the desktop client downloads.

You can read my full review here.

CSOonline: How to choose the best VPN for security and privacy

Enterprise choices for virtual private networks (VPNs) used to be so simple. You had to choose between two protocols and a small number of suppliers. Those days are gone. Thanks to the pandemic, we have more remote workers than ever, and they need more sophisticated protection. And as the war in Ukraine continues, more people are turning to VPNs to get around blocks imposed by Russia and other authoritarian governments,

A VPN is still useful and perhaps essential to a modern mostly remote workplace. In this post for CSO, I describe these scenarios, what security researchers have found about how VPNs leak data or have other privacy issues, and what you should look for if you intend to deploy them across your enterprise.

CNN: The best VPNs for 2022

CNN had me review a bunch of VPN services for their Underscored site. I looked at 11 different products. I don’t have to tell you why you should use a VPN. But no product can 100% handle the trade-off among three parameters: anonymity, or the ability to move online without anyone knowing who you are; privacy, or the ability to keep your own data to yourself; and security, or to prevent your computers and phones and other gear from being compromised by a criminal. You can’t do all three completely well unless you go back to pen and paper and the Pony Express. Using a VPN will help with all three aspects, and some are better than others at balancing all three.

My two favorites were Mullvad.net and IVPN.net. Both use a novel idea to ensure that they don’t know anything about you — when you download their software, you are assigned a random string of characters that you use to identify yourself. No email necessary. If you don’t want to use your credit card, you can pay via alt-coins too. Consider this a “single-factor” authentication. That means no password is required once you have entered your code, it is unlikely that anyone can guess this code or find it on the dark web (unless you reuse it, which you shouldn’t), and there is little chance anyone could connect it back to you even if they did manage to get a hold of the code in a breach.

Both vendors don’t have the largest server networks (that title is shared by Hotspot Shield, Private Internet Access, ExpressVPN and CyberGhost). But each of these are owned by corporate entities that play fast and loose with your private data (Aura and Kape Technologies). If you want to spend more time understanding the privacy issues, check out Yael Grauer’s excellent analysis for Consumer Reports Digital Lab here.

Not on my recommended list is the VPN that I have been using for the past several years — ProtonVPN (shown above). I am of two minds here. On the plus side, I have a fond spot in my nerd heart for Proton, the Swiss company that was an early proponent of encrypted email. But the VPN product is slower, more expensive, harder to use and more of an “OG” VPN that requires emails and credit cards to subscribe. Yael’s report also mentions some privacy difficulties with the service, as well as those well-advertised services mentioned above that have leaked data or aren’t as transparent as they claim to be.

If you leave home, you need to run some kind of VPN. Period.

CNN Underscored: Review of the best USB-C charging blocks

With USB-C finally more-or-less standard across phones, tablets and laptops, and fewer and fewer manufacturers including chargers in the box with their products, a myriad of charging blocks have become available that promise to get your batteries topped up as quickly as possible.

To find the best USB-C charger for your devices, we tested 15 devices from respected manufacturers to find the best for your needs, whether you need to charge a phone, a laptop, or a bagful of accessories. My top pick was the PowerPort Atom III Slim — it has a single USB-C port, and is rated at 45W (there are older versions still on the market that are rated at 30W, so make sure you are getting the higher capacity unit). We liked the smaller footprint slim design, which combines a slimer unit (5/8” thick) with a folding power prong. These make fitting it behind furniture (or carrying in your travel bag) easier.

You can read my review of these chargers here for CNN’s Underscored site.

CSOonline: 9 cloud and on-premises email security suites compared

Email remains the soft underbelly of enterprise security because it is the most tempting target for hackers. They just need one victim to succumb to a phishing lure to enter your network. Phishing (in all its forms) is just one of many attacks that can leverage a poorly protected email infrastructure. Account takeovers (due to reused passwords), business email compromises, payment fraud, specialized mobile malware, and spam messages that contain hidden malware or poisoned web links. That places a heavy burden on any email security solution.

I have been testing and writing about these products for decades and in this roundup I touch on some of the latest integrations and innovations with nine security suites:

  • Abnormal Security’s Integrated Cloud Email Security
  • Area 1’s Horizon
  • Barracuda Email Protection
  • Cisco Secure Email
  • FireEye Email Security
  • Voltage SecureMail
  • Mimecast Email Security
  • Trustifi
  • Zix Secure Cloud Email Security Suite

As what seems like the usual operating procedure, figuring out the pricing for the numerous configurations can be vexing, with one vendor (FireEye) not providing pricing, and several other vendors who declined to participate entirely.

You can read my full roundup for CSOonline here.

CSOonline: Homomorphic encryption tools find their niche

Organizations are starting to take an interest in homomorphic encryption, which allows computation to be performed directly on encrypted data without requiring access to a secret key. While the technology isn’t new (it has been around for more than a decade), many of its implementations are, and most of the vendors are either startups or have only had products sold within the past few years. While it’s difficult to obtain precise pricing, most of these tools aren’t going to be cheap: Expect to spend at least six figures and sign multi-year contracts to get started.

I review the early products in this market for CSOonline, describe some of the typical use cases, and provide some suggestions on how to evaluate them for enterprise uses.