Should I block AI training bots?

Probably not, unless you fall into the carve-out cases (paywall, internal docs, member-only, pre-publication, legally privileged). Blocking training bots like `GPTBot`, `ClaudeBot`, `Google-Extended`, and `Applebot-Extended` keeps your content out of future model training, but does nothing about content already in model weights, doesn't stop AI-powered search citation, and costs you the visibility most sites are actually trying to gain. If your concern is content licensing or revenue protection, `robots.txt` is the wrong tool; auth, IP blocking, or a WAF is.

Will blocking `GPTBot` remove my content from ChatGPT?

No. Past training is a one-way door, so anything already in OpenAI's models stays there. ChatGPT's search layer also reads Bing search results, so if Bing can crawl your site (default unless you've blocked Bingbot), your content can still appear in ChatGPT answers. Blocking `GPTBot` only affects future training crawls.

Does `robots.txt` actually stop AI from scraping?

Only for compliant bots. `robots.txt` is a search signal, not access control. Compliant crawlers (OpenAI, Anthropic, Google, Apple, Meta, Common Crawl) respect `Disallow:` directives. Non-compliant scrapers do not. The HTTP server still serves the page to anyone with the URL, including the scraper that just ignored your `Disallow:`.

Should I block `Bytespider` (ByteDance)?

You can list it, but ByteDance has not documented compliance behavior. Listing it is honor-system enforcement on a vendor whose honor is undocumented. If blocking Bytespider matters to you, server-side WAF or IP-based rules are the only enforceable path.

What about `llms.txt`?

It's an emerging adjacent signal, not yet standardized or widely respected. A handful of sites have started publishing `llms.txt` as an LLM-readable companion to `robots.txt`. None of the 30 sites in this survey had one. Worth tracking, not worth deploying as protection today.

Why do some people still recommend blocking AI crawlers?

The steelman is real: AI crawlers can be a meaningful bandwidth burden on small or static sites, compliant blocking does keep your content out of *future* model training (the past is gone either way), and publishers with paid licensing leverage have a commercial reason to disallow free crawls. Outside those cases, the cost of blocking (lost AI Overview citation surface, lost ChatGPT search visibility, lost live-retrieval surface from `ChatGPT-User` and `Claude-User`) outweighs the theoretical upside on visibility you wanted in the first place.

What about AI Labyrinths and tarpits?

A third option that's neither block nor allow: feed misbehaving crawlers fake or generated content to waste their resources instead of refusing them at the door. Cloudflare's AI Labyrinth is the most prominent productized example. Out of scope for this post. Worth knowing it exists if you're a publisher with serious bandwidth pain from non-compliant scrapers, not relevant for most sites where the question is whether to block compliant ones at all.

How do I tell if AI crawlers are actually hitting my site?

Server logs are the source of truth. Filter by the User-Agent strings of the bots in the section above (`GPTBot`, `ClaudeBot`, `PerplexityBot`, `Google-Extended`, `Applebot-Extended`, `meta-externalagent`, etc.) and look at request volume by bot per day. Most hosting platforms expose raw logs; tools like Screaming Frog's log file analyzer, GoAccess, or your CDN's analytics surface the same data faster. Whether you're being crawled is a separate question from whether you should block it. This post is about the second question.

Should I Block AI Bots? (Probably Not)

Technically you can block compliant AI bots through robots.txt, and you can't block the ones that ignore it. Strategically, you probably shouldn't. I audited 30 sites (methodology): every major news publisher, every major AI lab, the SEO industry's biggest names, the dev tools and reference sites you'd expect, and a couple wildcards. The pattern split sharper than I expected. Two groups tried to block: news publishers (8 of 8) and Reddit (1 of 1, for licensing reasons their public policy lays out). Every other category (AI labs, SEO authorities, dev tools, brands, reference sites, independents) allows everything.

The rest of this post walks through what robots.txt actually is, what it can and can't do, what those 30 configurations reveal, and what to actually deploy on yours.

What does `robots.txt` actually do?

robots.txt is a voluntary signal to crawlers about which parts of your site they should and shouldn't fetch. It's specified in RFC 9309, which formalized the 1994 Robots Exclusion Protocol. The spec is short. The behavior is simple. The thing people miss is what it isn't.

It isn't access control.

When you write Disallow: /private/, three things are true:

A compliant crawler will see the directive and choose not to fetch /private/.
A non-compliant crawler will ignore the directive and fetch it anyway.
The HTTP server will serve /private/ to anyone who has the URL: humans, scrapers, and the crawler that just ignored your Disallow.

That third point is where most "I disallowed it but Google still showed it" and "I blocked GPTBot and ChatGPT still cited me" confusion comes from. robots.txt is a search signal. It tells compliant crawlers how to treat URLs in their indexes. It does nothing about access.

Two different goals get conflated here:

Keep these pages out of organic search results. Disallow: works for this. Compliant search bots respect it; that's what they were built for. This is what Cloudflare's robots.txt is doing when it disallows 25+ locale variants of /searchresults and /lp. They don't care if a human with the URL sees those pages. They just don't want them in Google's index.
Hide these pages so unauthorized people can't see them. robots.txt does nothing for this. The page is still served to anyone with the URL. If you actually need this: authentication, IP blocking, or a web application firewall. Not robots.txt.

The rest of this post is about the first goal, with one section near the end on the second.

How does `robots.txt` work (and what can't it do)?

The technical answer to can you block AI crawlers is: yes, for the bots that have agreed to respect robots.txt. The honest answer is that those are the bots that need blocking least.

Here's the split. AI vendors run three kinds of crawlers: training bots that pull content for model training, search bots that index for AI-powered search, and live-retrieval bots that fetch a page when a user explicitly asks for it. Most major vendors (OpenAI, Anthropic, Google, Apple, Meta, Perplexity, Common Crawl, Amazon) document compliant user-agents that respect robots.txt, and a Disallow: / for any of them keeps the bot out by stated policy. Full list with vendor docs in Methodology. That's the can.

What you can't reach through robots.txt:

Bytespider (ByteDance) doesn't document compliance behavior, so listing it is hopeful, not enforceable.
PerplexityBot is honor-system enforcement on a vendor whose honor was already in question; Cloudflare documented stealth crawlers from Perplexity in August 2025.
Past training and indirect routes: anything already in model weights is a one-way door, and ChatGPT can still cite you via Bing search results regardless of how you set GPTBot or OAI-SearchBot.
Live-retrieval requests: blocking ChatGPT-User and Claude-User kills the live citation surface that's the entire reason to allow AI bots in the first place.
Archive.org caches: the Internet Archive doesn't respect robots.txt retroactively, and downstream model trainers consume archive.org content.

Each site got a two-layer audit: declared robots.txt plus actual server response, since the two often diverge. Full methodology at the bottom of the post.

Are other sites blocking AI crawlers?

I audited 30 sites: every major news publisher (WSJ, Bloomberg, Reuters, NYT, FT, BBC, Washington Post, Guardian), every major AI lab (Anthropic, Cloudflare, OpenAI, Perplexity, Mistral, Cohere), the SEO industry's biggest names (Moz, Ahrefs, Search Engine Journal, Search Engine Land, Yoast), the dev tools and reference sites you'd expect (Stripe, Vercel, GitHub, Wikipedia, Hugging Face), two major brands (Apple, Microsoft), Stack Overflow, Reddit, a thoughtful independent (Simon Willison), and my own site.

Two groups tried to block AI: news publishers (8 of 8) and Reddit (1 of 1, for licensing reasons their public policy lays out). Every other category (AI labs, SEO authorities, dev tools, brands, reference sites, independents) allows everything. If you're not a publisher and not Reddit, the field has already answered this question.

Category	Sites	Blocking AI?
News publishers	WSJ, Bloomberg, Reuters, NYT, FT, BBC, Washington Post, Guardian	Yes, 8 of 8
Reddit	reddit.com	Yes (licensing play)
AI labs	Anthropic, Cloudflare, OpenAI, Perplexity, Mistral, Cohere	No
SEO authorities	Moz, Ahrefs, Search Engine Journal, Search Engine Land, Yoast	No
Dev tool / SaaS	Stripe, Vercel, GitHub	No
Major brand	Apple, Microsoft	No
Reference / community	Wikipedia, Stack Overflow	No (Stack Overflow has no `robots.txt` at all)
AI infrastructure	Hugging Face	No
Independent	Simon Willison	No (explicit `ChatGPT-User` opt-in)
njseo	author	No

The publisher pattern is striking when you read the files back-to-back. They look like minor variations of the same document: same disallowed paths, same bot lists, same legacy bot names from the early AI-crawling debate that have been propagating across publisher templates for years. Whatever they're doing, they're mostly doing it in unison and copying from each other. None of them got there independently.

Reddit is the only site in the survey where the block is unambiguously deliberate. One-line Disallow: / for everyone, with a comment block pointing to their public content policy. The intent is commercial: they want OpenAI and Google to license access through their existing paid arrangements, not crawl free.

The remaining 21 sites have no AI-specific blocks worth flagging. The two clearest stated positions in this group come from people who actually thought about it: Cloudflare publishes an explicit Content-Signal: ai-train=yes, search=yes, ai-input=yes directive in their robots.txt (the company that sells AI bot blocking opts in on their own marketing site), and Yoast wrote a philosophy post (Nov 2019, predating the AI debate) arguing for minimal robots.txt restrictions because brute-force blocking causes more problems than it solves.

The point isn't that any of these sites is right or wrong. The point is that the field has clustered into two answers (block if you're a publisher with a licensing strategy, allow otherwise), and the rest of this post is about which cluster you're in and why the technical argument matters more than the social-proof argument either way.

What should my `robots.txt` look like?

If you've followed the argument, the implementation is short.

Default to allowing AI crawlers. Then disallow the categories of URL that have no business being in a search index, for reasons that have nothing to do with AI.

The "should disallow" categories, drawn from the survey plus standard SEO hygiene:

Paid landing pages. Cloudflare's robots.txt disallows /lp and 25 locale variants. Paid traffic destinations don't belong in organic search. They create thin pages that compete with your real content for ranking.
On-site search result pages. ?s=*, /search/, /searchresults. Infinite URL space. No unique content. Standard SEO move pre-dating AI by 15 years.
Locale subdomain mirrors. Only relevant if you have them. Cloudflare blocks dozens of <locale>.www.cloudflare.com/ paths to prevent duplicate content across alternate-locale subdomains.
Operational paths. /api/, /_next/data/ for Next.js, /wp-admin/admin-ajax.php for WordPress, /cdn-cgi/bm/cv/ and /cdn-cgi/challenge-platform/ for Cloudflare-protected sites. Anything that's infrastructure, not content.
URL parameter explosions. /*?s=*, /*?categories=*, sort and filter params on faceted listings. Each parameter combination is a separate URL; together they create infinite duplicate content.

These have always been the right disallows. They aren't AI policy. They're crawl-budget hygiene that compliant search bots, including AI search bots, have always benefited from.

For njseo, the file looks like this:

# njseo's AI posture: explicit opt-in to AI training, search, and live retrieval.

User-agent: *
Disallow: /api/
Disallow: /_next/data/
Allow: /_next/static/
Allow: /_next/image

Sitemap: https://newjerseyseofirm.com/sitemap.xml

That's it. No per-bot AI rules. No Disallow: / aimed at GPTBot or ClaudeBot. Just the operational hygiene any Next.js site needs and a sitemap declaration.

If you're behind Cloudflare, add Disallow: /cdn-cgi/. If you have on-site search, add Disallow: /?s=* (or whatever your search URL pattern is). If you have an /lp/ directory for paid campaigns, add it.

For anything I haven't named, the question is the same: do I want this URL appearing in any search index? If no, disallow. If yes, leave it open.

The remaining question is when blocking AI specifically is the right call. That's the next section.

When is blocking AI warranted?

There are real cases where you want to keep specific content out of AI training and citation. Five categories I'd take seriously:

Hard-paywall content where the business model depends on subscribers paying for access
Internal documentation, staging environments, beta tooling
Member-only or paid-community content
Pre-publication drafts and embargoed material
Legally privileged content: HIPAA, attorney-client, contractual confidentiality

For any of these, robots.txt is the wrong tool. The reason is the central distinction this post opened with: robots.txt is a search signal, not access control. A Disallow: will keep your medical-records page out of Google's index. It will also serve that page to anyone with the URL, including the non-compliant scraper that's about to ignore your Disallow: anyway.

If you actually need to keep content out of AI training and citation:

Authentication. Login wall. The page is not served to anyone without credentials.
IP blocking. Network-level restriction for staging environments and internal tooling.
WAF or bot management. Cloudflare and others sell selective bot-aware blocking that operates on the actual HTTP request, not on a voluntary signal in a text file.

These are the tools that match the goal. robots.txt was never built for this.

The reason "Probably Not" is in the title of this post and not "No" is exactly this carve-out. There are sites where the right answer is to keep AI out, and there are SEO clients I've worked with where the right answer included exactly this kind of separation. Just not by way of robots.txt.

FAQ

Probably not. I audited 30 sites (every major news publisher, every major AI lab, the SEO industry's biggest names), and the pattern is clear: only news publishers with paid licensing strategies (8 of 8) and Reddit (for the same reason) blocked AI. Every AI lab, every dev tool, every reference site, every SEO authority allowed everything. If you're not a publisher and you're not Reddit, the field has already answered this question.

Can you block AI crawlers? Yes, the compliant ones. Should you? Probably not, unless you fall into the carve-out cases above, and even then robots.txt is the wrong tool for them.

Twenty of the twenty-two non-publisher sites I audited landed at the same answer with no caveats. So did all six AI labs whose bots you'd be blocking. The only category that tried to block was news publishers, and they're mostly doing it from a broken template.

Once robots.txt is settled, the next question is whether you've made it easy for AI to actually use what you've allowed. PDFs that aren't crawlable, content that only renders in JavaScript, schema that says nothing about who you are or what you do. Those are the subtler ways sites end up invisible to AI even with everything wide open. That's a separate conversation about AI readiness.

If you'd rather not work through any of this on your own, book a free 30-minute consultation and we'll walk through your robots.txt and the rest of your AI surface together.

Methodology

Eighteen years of audits taught me that what a site's robots.txt declares and what it actually serves often diverge. So this survey is a two-layer pass over 30 sites, audited in May 2026:

Layer 1: declared robots.txt. What each site publishes at /robots.txt.
Layer 2: actual server behavior. What the WAF or firewall above the file actually enforces.

Most studies of AI-crawler posture measure scale (Cloudflare's network-wide bot reports, Originality.ai's live dashboard tracking the top 1,000 sites, the Data Provenance Initiative's Consent in Crisis across 14,000 domains) and answer how many sites are blocking. This post asks a different question: which categories of site have thought through their posture, and what should yours look like. The 30 sites are a deliberate spread across every category that has skin in this game (publishers, AI labs, SEO authorities, dev tools, brands, reference sites, independents) so the cluster patterns can be read directly.

Compliant AI crawlers (May 2026). The major vendors document user-agents that respect robots.txt. A Disallow: / for any of these keeps them out, by stated policy.

Vendor	Training	Search	Live retrieval
OpenAI	`GPTBot`	`OAI-SearchBot`	`ChatGPT-User`*
Anthropic	`ClaudeBot`	`Claude-SearchBot`	`Claude-User`
Google	`Google-Extended`	(uses Googlebot)	—
Apple	`Applebot-Extended`	—	—
Meta	`meta-externalagent`	—	—
Perplexity	—	`PerplexityBot`	—
Common Crawl	`CCBot`	—	—
Amazon	`Amazonbot`	—	—

*OpenAI notes ChatGPT-User is user-initiated and robots.txt rules may not apply.

Vendor docs verified May 2026: OpenAI, Anthropic.

Limitations. This is a single-day snapshot, not a longitudinal study; some sites (Cloudflare, especially) update robots.txt and bot-management rules frequently. Domains are US/EU-centric and English-language. Enterprise SaaS, e-commerce, and non-English publishers aren't represented. The two-layer pass detects WAF-vs-robots.txt divergence but doesn't measure crawler compliance directly. Bots that ignore robots.txt would only be caught by log analysis, which a third-party survey can't do.

Corrections. Spot an error in the survey or a misread of any site's posture? Send a note and I'll re-audit and update.

ai-botsai-crawlersrobots-txtai-searchaeogeofield-notes

WRITTEN BY

Eric Murtha

SEO & Answer Engine Optimization Specialist

I'm an independent SEO and answer engine optimization specialist based in Morris County. I help small businesses rank in Google, and now in ChatGPT, Perplexity, and Google's AI overviews. No agency overhead. No junior account managers. Just focused, expert work.