Created on 2025-08-19 12:38
Published on 2025-09-01 10:30
If you run production websites or platforms, you’re probably stuck between two loud forces. On one side, publishers and platform owners are tightening the gates with bot mitigation, CAPTCHAs, and “no-AI” robots rules to protect revenue, performance, and legal rights. On the other, researchers, journalists, and data teams argue that scraping public pages enables transparency, accessibility, and even scientific progress. In 2025 the tension spiked again after Cloudflare publicly accused Perplexity of evading robots rules with stealth crawlers, while many reputable bots and AI vendors emphasized honoring robots.txt and opt-outs. For SRE and DevOps leaders, this is not an abstract ethics squabble; it’s a live reliability, latency, and customer-experience problem with legal edges sharp enough to cut. (The Cloudflare Blog, TechRadar, Business Insider)
Publishers see automated scraping as an existential drain on infrastructure and intellectual property. Bandwidth gets eaten by non-user traffic, origin servers collapse under uncontrolled spiders, and paywalled content leaks into aggregators and AI models with no license or revenue share. The law, particularly in Europe, gives some levers: the EU Database Directive grants a sui generis right to stop “extraction or re-utilisation” of a qualitatively or quantitatively substantial part of a database, and courts have said that even when database IP doesn’t bite, site owners can still prohibit scraping via contract terms (see Ryanair v. PR Aviation). In practical terms: if you force users to accept click-through terms that ban scraping, you have real footing—especially in the EU. (Open Future, Curia, Pinsent Masons)
Publishers also argue that robots rules, bot defenses, and verified-bot allowlists are now table stakes. Cloudflare offers “Super Bot Fight Mode,” a one-click “block AI bots” setting, and Turnstile, a privacy-preserving CAPTCHA alternative that reduces friction for humans while throttling scrapers. These measures promise performance protection with less UX pain, and the verified-bot program clarifies which crawlers deserve a green light. From a reliability angle, that means fewer random spikes and more predictable capacity planning—if you tune them carefully. (Cloudflare Docs, The Cloudflare Blog, WIRED)
Finally, the legal mood music has changed around AI training. The New York Times’ lawsuit against OpenAI and Microsoft crystallized publisher concerns about mass ingestion of paywalled news. In parallel, the EU’s text-and-data-mining (TDM) regime allows opt-outs for commercial mining, and a 2024 German decision suggested research bodies like LAION can rely on a research exception while commercial miners face tighter limits. This patchwork encourages publishers to harden defenses and to declare explicit opt-outs—both technically and in terms. (McKool Smith, Enterprise Ireland, Morrison Foerster)
Opponents of strict anti-crawler measures point to the social value of scraping. Common Crawl has provided a free corpus of hundreds of billions of pages for over a decade, fueling academic research and public-interest projects. Statistical agencies (like Eurostat) rely on structured web scraping to compute official indices such as consumer prices. And data journalism—from investigations into pharma payments to public records—has historically depended on scraping when APIs don’t exist or are artificially constrained. These are not “rogue bots”; they’re cornerstone activities for transparency, reproducibility, and accessibility. (commoncrawl.org, European Commission, ProPublica)
The legal picture is also more nuanced than “don’t scrape.” In the U.S., hiQ v. LinkedIn reaffirmed that scraping publicly available web pages is not a CFAA violation merely because a platform disapproves; public pages are, legally speaking, “public.” Meanwhile, Van Buren v. United States narrowed “exceeds authorized access,” and a D.C. district court in Sandvig signaled that violating terms of service alone shouldn’t trigger criminal CFAA liability for public-interest research. Together, these cases have emboldened researchers to argue that carefully conducted scraping of public data—done respectfully—serves speech and scientific interests. (cdn.ca9.uscourts.gov, Supreme Court, Electronic Frontier Foundation)
But there are guardrails. U.S. cases like Facebook v. Power Ventures found that continuing to access a site after a cease-and-desist and technical IP blocks can violate the CFAA. And in the EU, scraping can trip over the Database Directive or data-protection law (GDPR), as seen in repeated enforcement actions against Clearview AI’s massive face-image harvesting. “Public” doesn’t mean “permissionless” everywhere, and personal data adds a separate compliance layer. (cdn.ca9.uscourts.gov, European Data Protection Board)
Robots rules are vital but often misunderstood. In 2022, the Robots Exclusion Protocol became RFC 9309, which plainly states that robots.txt is not an access-control or authorization mechanism; it’s a request for crawler behavior. Well-behaved bots honor it; malicious or stealthy ones won’t. This matters to SREs because relying on robots for “security” is a brittle design; you still need authentication, rate controls, and network policies. And if you misconfigure robots, you can accidentally deindex critical pages and crater search traffic overnight. (IETF Datatracker, Search Engine Journal)
The modern twist is AI-specific crawling. Many vendors, including OpenAI, document how to signal “don’t use my content for training” via robots or meta tags and say they respect those signals. Yet allegations like the 2025 Cloudflare–Perplexity dispute show that some actors may ignore or bypass rules, rotating IPs and spoofing user agents. For SRE and security teams, that’s a cue to treat robots as an etiquette layer—not a control plane—and to instrument hard controls without punishing humans. (OpenAI Platform, The Cloudflare Blog)
Overly aggressive bot defenses can backfire. Challenging every session inflates latency, degrades accessibility, and can accidentally block “good bots” such as Googlebot, nuking discoverability and top-of-funnel acquisition. Cloudflare’s own docs and community threads acknowledge the need to allow verified bots and to tune Super Bot Fight Mode to avoid false positives. On the other hand, permissive settings invite resource exhaustion and content exfiltration. That’s a classic SRE dial: where to place the slider between resilience and openness, and how to measure the cost of each click. (Cloudflare Docs, Cloudflare Community)
CAPTCHAs, long the default, are also an accessibility and experience tax. W3C has flagged CAPTCHA inaccessibility concerns for years, and most of us have watched users rage-quit after the third “find the crosswalks” round. Privacy-preserving challenges like Cloudflare Turnstile and standards like Privacy Pass / Private Access Tokens promise fewer puzzles and better privacy while still filtering obvious automation. Translating that into site reliability: fewer human-friction incidents, lower abandonment, and a measurable drop in “challenge time” as part of your latency SLO. (W3C, WIRED, IETF Datatracker)
Pro-defense advocates say the web’s economics and legal frameworks depend on consent and control; if models and aggregators can vacuum up content without permission, creators won’t be paid and smaller sites will be DoS’ed by bots. They see Cloudflare’s AI-bot blocking and strict crawler verification as necessary infrastructure, and they point to EU law and recent AI litigation as validation. (The Cloudflare Blog, McKool Smith)
Open-access advocates counter that scraping is often the only way to study the internet itself. They emphasize the public nature of many sources, the fair-use precedent for indexing and search (think Google Books), and the societal benefits of reproducible research corpora like Common Crawl. They also warn that making the web un-scrapable breaks archiving, accessibility, and oversight—especially when APIs or bulk downloads are not provided. (copyright.gov, commoncrawl.org)
Here’s the human pattern I keep seeing: teams flip a “maximum protection” toggle after an ugly scraping incident, pat themselves on the back, then spend months chasing down false positives and “why did SEO crater?” pings. The opposite happens too: an “open by default” stance that treats robots.txt as security and only rings alarms when origin CPU flatlines. Neither posture is adult reliability. The SRE move is to treat crawler access as a first-class production workload with SLOs, runbooks, and feedback loops—not as a side quest for the SEO team.
Start by naming the outcomes that matter. If your business depends on search, “Good-bot success rate” is as real an SLO as availability. If you’re a publisher, establish a “Training Opt-Out Enforcement SLO” measuring how quickly your stack enforces new AI opt-out rules across CDNs and origins. If you run a public-interest site, define a “Researcher Access SLO” ensuring that verified academic partners can fetch at useful rates without being challenged to death. The moment you measure these, you’ll see the trade-offs in alert fatigue, latency, and conversion more clearly than any policy memo.
First, publish a clear, layered access policy and make it machine-readable. Robots.txt remains the lingua franca, and since it’s formally documented as RFC 9309, good actors can integrate reliably. If you wish to opt out of AI training where supported, declare it explicitly and mirror the stance in your human-readable terms. For truly sensitive paths, don’t rely on robots—put content behind authentication or paywalls, because robots rules are not access control. Then socialize this policy with your marketing, legal, and data partnerships teams so you’re not fighting an internal civil war. (IETF Datatracker)
Second, create a two-track access path: APIs and verified scraping. Provide stable, rate-limited APIs or bulk feeds for partners and researchers, even if limited. Where APIs aren’t feasible, establish a verification process for “good bots” and research crawlers with tokens, contact emails, and published constraints. Cloudflare’s verified-bots framework is one path; if you run your own controls, maintain a living allowlist tied to telemetry. This blend preserves openness without leaving your origin at the mercy of anonymous swarms. (Cloudflare Docs)
Third, replace punitive CAPTCHAs with privacy-preserving challenges and Privacy Pass tokens wherever possible. Turnstile-style checks and Private Access Tokens reduce friction dramatically, especially on mobile and assistive technologies, and they align with WCAG guidance. Measure human challenge time as part of user-facing latency SLOs, and aim to keep it near zero. When challenges must appear, prioritize accessible modalities and treat challenge-failure spikes as incidents to be blamelessly debugged. (WIRED, IETF Datatracker, W3C)
Fourth, canary and observe bot rules like any high-risk change. Roll out bot-management rules behind feature flags, watch error budgets for “good-bot” SLOs and for human traffic conversion, and publish dashboards that correlate rule changes with crawl rates, indexation coverage, and organic traffic. If you see verified bots getting throttled or “Blocked by robots” spikes in Search Console, treat it as a Sev-2 with a defined rollback path. A WAF toggle that silently tanks Googlebot is not a security win; it’s a revenue degradation. (Rank Math)
Fifth, align with the law but plan for ambiguity. In the U.S., draw the line at cease-and-desist plus technical blocks—cases like Power Ventures show that evading them can cross into “without authorization.” In the EU, factor the Database Directive and TDM exceptions into your policy: research exceptions exist, commercial miners can be opted out, and GDPR applies to personal data regardless. Publish a “research access” page that sets expectations and gives legitimate researchers a way in without gamesmanship. This lowers the odds that your security team blocks a Nobel laureate by accident. (cdn.ca9.uscourts.gov, Enterprise Ireland, Open Future)
A mid-sized European publisher saw origin CPU spike nightly at 02:00. The instinct was to slam the door with “Block AI bots” and generic challenges for everyone. Instead, they took an SRE route: they carved out verified search and accessibility bots, created an email-verified researcher token that eased throttles, and shifted to Turnstile for interactive flows. They also published a clear AI-training opt-out and aligned it with Cloudflare’s bot settings. The result wasn’t perfect harmony—stealth scrapers still appear—but organic traffic recovered, newsroom tools stayed snappy, and the angry inbox cooled. What changed most was posture: they moved from whack-a-bot to operating a crawler service with SLOs.
Are we measuring “good-bot success rate” and “human challenge time” with the same seriousness as uptime, or are we flying blind on our crawler experience?
If robots.txt is not authorization, which specific controls—auth, rate limits, token-based allowlists—backstop our most sensitive content today? (IETF Datatracker)
Where do we draw the line between public-interest scraping and abuse, and how do researchers, journalists, and accessibility tools request elevated access without being treated like attackers? (European Commission)
If legal ambiguity is inevitable, what’s our declared default for AI training: licensed, opted-out, or “research-only,” and how fast can we propagate a new stance across CDN and origin? (Enterprise Ireland)
Scraping isn’t simply theft or virtue; it’s a capability. It can underwrite accountability and accessibility, or it can overwhelm origin servers and siphon value from creators. Anti-crawler defenses aren’t simply locks; they’re production features that can guard reliability and rights—or quietly choke discovery and accessibility if misapplied. The SRE posture is to make these trade-offs explicit, measurable, and reversible. Build for consent where it matters, openness where it helps, and reliability everywhere, because the web works best when it’s both usable and examinable. Your future incidents won’t ask whether you “believed” in scraping; they’ll ask how well you designed for it.
Cloudflare, “Perplexity is using stealth, undeclared crawlers to evade website no-crawl directives,” https://blog.cloudflare.com/perplexity-is-using-stealth-undeclared-crawlers-to-evade-website-no-crawl-directives/. TechRadar Pro, “Cloudflare says Perplexity is breaking a major online AI scraping rule,” https://www.techradar.com/pro/cloudflare-says-perplexity-is-breaking-a-major-online-ai-scraping-rule. Business Insider, “An AI data trap catches Perplexity impersonating Google,” https://www.businessinsider.com/ai-data-trap-catches-perplexity-impersonating-google-cloudflare-2025-8. RFC 9309, “Robots Exclusion Protocol,” IETF Datatracker, https://datatracker.ietf.org/doc/rfc9309/. Cloudflare Docs, “Super Bot Fight Mode — Get started,” https://developers.cloudflare.com/bots/get-started/super-bot-fight-mode/. Cloudflare Blog, “Declaring your AIndependence: block AI bots with a single click,” https://blog.cloudflare.com/declaring-your-aindependence-block-ai-bots-scrapers-and-crawlers-with-a-single-click/. Cloudflare, “Verified bots policy,” https://developers.cloudflare.com/bots/concepts/bot/verified-bots/policy/. Cloudflare Turnstile product page, https://www.cloudflare.com/application-services/products/turnstile/. Wired, “Cloudflare Takes a Stab at a Captcha That Doesn’t Suck,” https://www.wired.com/story/cloudflare-captcha-turnstile/. IETF, “The Privacy Pass HTTP Authentication Scheme (RFC 9577),” https://datatracker.ietf.org/doc/rfc9577/. W3C WAI, “Inaccessibility of CAPTCHA,” https://www.w3.org/WAI/intro/captcha. Verizon/Eurostat, “Guidelines on web scraping for HICP,” European Commission (Eurostat), https://ec.europa.eu/eurostat/documents/272892/12032198/Guidelines-web-scraping-HICP-11-2020.pdf. Common Crawl — About, https://commoncrawl.org/. hiQ Labs, Inc. v. LinkedIn Corp. (9th Cir. 2022), opinion PDF, https://cdn.ca9.uscourts.gov/datastore/opinions/2022/04/18/17-16783.pdf. Facebook, Inc. v. Power Ventures, Inc. (9th Cir. 2016), opinion PDF, https://cdn.ca9.uscourts.gov/datastore/opinions/2016/07/12/13-17102.pdf. Van Buren v. United States (U.S. Supreme Court 2021), opinion PDF, https://www.supremecourt.gov/opinions/20pdf/19-783_k53l.pdf. EFF, “Federal Judge Rules It Is Not a Crime to Violate a Website’s Terms of Service,” https://www.eff.org/deeplinks/2020/04/federal-judge-rules-it-not-crime-violate-websites-terms-service. Authors Guild v. Google — U.S. Copyright Office summary, https://www.copyright.gov/fair-use/summaries/authorsguild-google-2dcir2015.pdf. EDPB, “French SA fines Clearview AI EUR 20 million,” https://www.edpb.europa.eu/news/national-news/2022/french-sa-fines-clearview-ai-eur-20-million_en. EDPB, “Italian SA fines Clearview AI EUR 20 million,” https://www.edpb.europa.eu/news/national-news/2022/facial-recognition-italian-sa-fines-clearview-ai-eur-20-million_en. Court of Justice of the EU, Ryanair Ltd v PR Aviation BV (C-30/14), judgment text, https://curia.europa.eu/juris/document/document.jsf?docid=161388&doclang=EN. Reed Smith, “Text and data mining in EU: a tale of two exceptions,” https://www.reedsmith.com/en/perspectives/ai-in-entertainment-and-media/2024/02/text-and-data-mining-in-eu. MoFo, “First court decision on the EU copyright exception for TDM in Germany,” https://www.mofo.com/resources/insights/241004-to-scrape-or-not-to-scrape-first-court-decision. OpenAI, “Overview of OpenAI crawlers,” https://platform.openai.com/docs/bots. W3C, “Web Content Accessibility Guidelines (WCAG) 2.1,” https://www.w3.org/TR/WCAG21/. Wired, “How to Stop Your Data From Being Used to Train AI,” https://www.wired.com/story/how-to-stop-your-data-from-being-used-to-train-ai.
#SRE #SiteReliability #DEVOPS #WebScraping #Bots #RobotsTxt #Privacy #Copyright #GDPR #TDM #Accessibility #DataJournalism #Observability #APIs #Cloudflare #CAPTCHA