Not all AI bots do the same job. Some feed training data, some feed AI answers that cite you, some fetch pages on a user's behalf. Block the wrong one and you disappear from the answers.

For most brands, the answer is no — at least not the bots that put you in AI answers. Blocking OAI-SearchBot or PerplexityBot removes you from ChatGPT search results and Perplexity citations; whether to block training crawlers like GPTBot is a separate decision with much lower stakes. The mistake almost every "block AI bots" tutorial makes is treating them as one category.

What does each AI bot actually do?

There are three distinct jobs, and the trade-offs differ for each:

Training crawlers collect content for future model training. Blocking them has no effect on today's AI answers.
Search/citation crawlers build the indexes AI assistants search when answering live questions — with links back to you. Blocking them removes you from those answers.
On-demand fetchers retrieve a specific page because a user asked an assistant to look at it. Blocking them breaks the experience for a human who is actively interested in you.

Bot	Operator	Job	If you block it
GPTBot	OpenAI	Training	Content excluded from future OpenAI training
OAI-SearchBot	OpenAI	Search index	You drop out of ChatGPT search results and citations
ChatGPT-User	OpenAI	On-demand fetch	ChatGPT can't open your pages when a user asks
ClaudeBot	Anthropic	Training-oriented crawl	Content excluded from Anthropic crawling
Claude-User / Claude-SearchBot	Anthropic	On-demand fetch / search	Claude can't fetch or surface your pages
PerplexityBot	Perplexity	Search index	You disappear from Perplexity answers
Perplexity-User	Perplexity	On-demand fetch	Perplexity can't open your pages for a user
Google-Extended	Google	Training control token	Content excluded from Gemini training. Does not remove you from Google Search or AI Overviews — those follow normal Googlebot indexing
CCBot	Common Crawl	Open dataset	Content excluded from a corpus many labs train on

What is the actual trade-off?

Blocking training bots (GPTBot, ClaudeBot, Google-Extended, CCBot) is a values-and-licensing decision. The cost is long-term: future models may know less about you, and models increasingly answer from what they already know. The benefit is control over how your content is used.

Blocking search and on-demand bots (OAI-SearchBot, PerplexityBot, ChatGPT-User and their peers) is a visibility decision, and the cost is immediate. These are the crawlers behind generative engine optimization: they fetch your pages so assistants can cite and link them in live answers. Block them and buyers asking "best X for Y" simply get answers built from your competitors' pages.

What should your robots.txt look like?

Posture 1 — open (recommended default). Let everything in; steer with a sitemap and an llms.txt file:

User-agent: *
Allow: /
Sitemap: https://example.com/sitemap.xml

Posture 2 — selective: no training, yes citations. The common middle ground for publishers:

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: ChatGPT-User
Allow: /

Posture 3 — closed. Add Disallow: / blocks for the search and user-agent bots too. Only defensible for paywalled or licensed-content businesses that have decided AI answers are a channel they can live without — and even then, expect zero AI-answer presence.

How do you know a bot is real?

robots.txt is a request, not a firewall — and plenty of scrapers spoof AI user agents to borrow their reputation. The standard check is forward-confirmed reverse DNS (FCrDNS): reverse-resolve the requesting IP to a hostname, confirm the hostname belongs to the operator's published domain, then forward-resolve it back to the same IP. OpenAI, Anthropic, Google, and Perplexity all publish their IP ranges or verification domains for exactly this purpose. Legible's truth layer runs FCrDNS verification on crawler traffic as part of its machine-access checks, so "we're being crawled by GPTBot" claims are tested rather than assumed.

If you rate-limit or block by user-agent string alone, you'll punish real assistants and let impostors through.

Our recommendation

Stay open to search, citation, and on-demand fetchers — that traffic is a buyer or an answer that names you. Decide training access separately, on licensing grounds, knowing it changes nothing about this quarter's visibility. And whichever posture you choose, write it down deliberately: an accidental Disallow: / for OAI-SearchBot is the quietest way to vanish from the fastest-growing answer surface.

Want to see which crawlers can currently reach you, and what they find when they do? Run a free Legible report — machine access is one of the eight dimensions it scores.

Should you block AI crawlers? GPTBot, ClaudeBot, and PerplexityBot explained

What does each AI bot actually do?

What is the actual trade-off?

What should your robots.txt look like?

How do you know a bot is real?

Our recommendation

Related