Perplexity Accused of Unauthorized Website Scraping Again

Spread the love

Share It:

ChatGPT Perplexity WhatsApp LinkedIn X Grok Google AI

Reports from Cloudflare reveal alarming tactics employed by Perplexity regarding its web crawling practices. Allegations suggest that Perplexity’s web crawlers are engaging in stealth crawling techniques, cleverly disguising their identities to bypass common restrictions found in robots.txt files and firewalls. This deceptive behavior raises significant concerns about data scraping ethics and compliance with web standards.

The robots.txt file is a critical resource that informs web crawlers about a website’s scraping permissions. Perplexity’s official crawling entities, known as PerplexityBot and Perplexity-User, are supposed to adhere to these guidelines. However, Cloudflare’s investigations revealed that even when these specific bots were disallowed by the robots.txt file, Perplexity managed to extract content from new, unindexed websites. This issue persisted even for sites equipped with specific Web Application Firewall (WAF) rules, which are designed to restrict web crawlers from accessing their content.

A flowchart created by Cloudflare to illustrate the different ways Perplexity's web crawlers try to access the content of a website. — Cloudflare

According to Cloudflare’s analysis, it appears that Perplexity is circumventing these security measures by utilizing “a generic browser intended to impersonate Google Chrome on macOS” when its designated bots are restricted by robots.txt. Additionally, the testing showed that Perplexity’s unidentified crawler could alternate between different IP addresses not included in its official range, allowing it to breach firewalls. Furthermore, Cloudflare indicated that Perplexity’s operations extend to using multiple autonomous system numbers (ASNs), which are unique identifiers for groups of IP addresses managed by a single entity, noting that the crawler was observed to switch ASNs “across tens of thousands of domains and millions of requests per day.”

In response to the findings, Engadget has reached out to Perplexity seeking their perspective on Cloudflare’s claims. This article will be updated should we receive any feedback from the company.

Access to current and accurate information from various websites is essential for companies that are training AI models. This is particularly true for services like Perplexity, which aim to function as alternatives to traditional search engines. In previous incidents, Perplexity has also been reported to bypass restrictions to maintain updated content access. In 2024, several websites highlighted that Perplexity continued to reach their content despite explicit prohibitions in their robots.txt files. At that time, the company attributed this issue to third-party web crawlers it had employed. Subsequently, Perplexity established partnerships with various publishers to share revenue generated from advertisements placed alongside their content, seemingly as a corrective measure for past infractions.

Efforts to prevent companies from scraping content from the internet will likely remain a persistent challenge, akin to a game of whack-a-mole. In the interim, Cloudflare has taken steps to exclude Perplexity’s bots from its verified bot list and has implemented mechanisms to identify and block Perplexity’s stealth crawler from gaining access to its customers’ valuable content.