Legible.
← Resources
GuideJul 5, 20264 min read

Should you block AI crawlers? GPTBot, ClaudeBot, and PerplexityBot explained

Not all AI bots do the same job. Some feed training data, some feed AI answers that cite you, some fetch pages on a user's behalf. Block the wrong one and you disappear from the answers.

For most brands, the answer is no — at least not the bots that put you in AI answers. Blocking OAI-SearchBot or PerplexityBot removes you from ChatGPT search results and Perplexity citations; whether to block training crawlers like GPTBot is a separate decision with much lower stakes. The mistake almost every "block AI bots" tutorial makes is treating them as one category.

What does each AI bot actually do?

There are three distinct jobs, and the trade-offs differ for each:

Bot Operator Job If you block it
GPTBot OpenAI Training Content excluded from future OpenAI training
OAI-SearchBot OpenAI Search index You drop out of ChatGPT search results and citations
ChatGPT-User OpenAI On-demand fetch ChatGPT can't open your pages when a user asks
ClaudeBot Anthropic Training-oriented crawl Content excluded from Anthropic crawling
Claude-User / Claude-SearchBot Anthropic On-demand fetch / search Claude can't fetch or surface your pages
PerplexityBot Perplexity Search index You disappear from Perplexity answers
Perplexity-User Perplexity On-demand fetch Perplexity can't open your pages for a user
Google-Extended Google Training control token Content excluded from Gemini training. Does not remove you from Google Search or AI Overviews — those follow normal Googlebot indexing
CCBot Common Crawl Open dataset Content excluded from a corpus many labs train on

What is the actual trade-off?

Blocking training bots (GPTBot, ClaudeBot, Google-Extended, CCBot) is a values-and-licensing decision. The cost is long-term: future models may know less about you, and models increasingly answer from what they already know. The benefit is control over how your content is used.

Blocking search and on-demand bots (OAI-SearchBot, PerplexityBot, ChatGPT-User and their peers) is a visibility decision, and the cost is immediate. These are the crawlers behind generative engine optimization: they fetch your pages so assistants can cite and link them in live answers. Block them and buyers asking "best X for Y" simply get answers built from your competitors' pages.

What should your robots.txt look like?

Posture 1 — open (recommended default). Let everything in; steer with a sitemap and an llms.txt file:

User-agent: *
Allow: /
Sitemap: https://example.com/sitemap.xml

Posture 2 — selective: no training, yes citations. The common middle ground for publishers:

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: ChatGPT-User
Allow: /

Posture 3 — closed. Add Disallow: / blocks for the search and user-agent bots too. Only defensible for paywalled or licensed-content businesses that have decided AI answers are a channel they can live without — and even then, expect zero AI-answer presence.

How do you know a bot is real?

robots.txt is a request, not a firewall — and plenty of scrapers spoof AI user agents to borrow their reputation. The standard check is forward-confirmed reverse DNS (FCrDNS): reverse-resolve the requesting IP to a hostname, confirm the hostname belongs to the operator's published domain, then forward-resolve it back to the same IP. OpenAI, Anthropic, Google, and Perplexity all publish their IP ranges or verification domains for exactly this purpose. Legible's truth layer runs FCrDNS verification on crawler traffic as part of its machine-access checks, so "we're being crawled by GPTBot" claims are tested rather than assumed.

If you rate-limit or block by user-agent string alone, you'll punish real assistants and let impostors through.

Our recommendation

Stay open to search, citation, and on-demand fetchers — that traffic is a buyer or an answer that names you. Decide training access separately, on licensing grounds, knowing it changes nothing about this quarter's visibility. And whichever posture you choose, write it down deliberately: an accidental Disallow: / for OAI-SearchBot is the quietest way to vanish from the fastest-growing answer surface.

Want to see which crawlers can currently reach you, and what they find when they do? Run a free Legible report — machine access is one of the eight dimensions it scores.

Related

Score your site