For most brands, the answer is no — at least not the bots that put you in AI answers. Blocking OAI-SearchBot or PerplexityBot removes you from ChatGPT search results and Perplexity citations; whether to block training crawlers like GPTBot is a separate decision with much lower stakes. The mistake almost every "block AI bots" tutorial makes is treating them as one category.
What does each AI bot actually do?
There are three distinct jobs, and the trade-offs differ for each:
- Training crawlers collect content for future model training. Blocking them has no effect on today's AI answers.
- Search/citation crawlers build the indexes AI assistants search when answering live questions — with links back to you. Blocking them removes you from those answers.
- On-demand fetchers retrieve a specific page because a user asked an assistant to look at it. Blocking them breaks the experience for a human who is actively interested in you.
| Bot | Operator | Job | If you block it |
|---|---|---|---|
| GPTBot | OpenAI | Training | Content excluded from future OpenAI training |
| OAI-SearchBot | OpenAI | Search index | You drop out of ChatGPT search results and citations |
| ChatGPT-User | OpenAI | On-demand fetch | ChatGPT can't open your pages when a user asks |
| ClaudeBot | Anthropic | Training-oriented crawl | Content excluded from Anthropic crawling |
| Claude-User / Claude-SearchBot | Anthropic | On-demand fetch / search | Claude can't fetch or surface your pages |
| PerplexityBot | Perplexity | Search index | You disappear from Perplexity answers |
| Perplexity-User | Perplexity | On-demand fetch | Perplexity can't open your pages for a user |
| Google-Extended | Training control token | Content excluded from Gemini training. Does not remove you from Google Search or AI Overviews — those follow normal Googlebot indexing | |
| CCBot | Common Crawl | Open dataset | Content excluded from a corpus many labs train on |
What is the actual trade-off?
Blocking training bots (GPTBot, ClaudeBot, Google-Extended, CCBot) is a values-and-licensing decision. The cost is long-term: future models may know less about you, and models increasingly answer from what they already know. The benefit is control over how your content is used.
Blocking search and on-demand bots (OAI-SearchBot, PerplexityBot, ChatGPT-User and their peers) is a visibility decision, and the cost is immediate. These are the crawlers behind generative engine optimization: they fetch your pages so assistants can cite and link them in live answers. Block them and buyers asking "best X for Y" simply get answers built from your competitors' pages.
What should your robots.txt look like?
Posture 1 — open (recommended default). Let everything in; steer with a sitemap and an llms.txt file:
User-agent: *
Allow: /
Sitemap: https://example.com/sitemap.xml
Posture 2 — selective: no training, yes citations. The common middle ground for publishers:
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: OAI-SearchBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: ChatGPT-User
Allow: /
Posture 3 — closed. Add Disallow: / blocks for the search and user-agent bots too. Only defensible for paywalled or licensed-content businesses that have decided AI answers are a channel they can live without — and even then, expect zero AI-answer presence.
How do you know a bot is real?
robots.txt is a request, not a firewall — and plenty of scrapers spoof AI user agents to borrow their reputation. The standard check is forward-confirmed reverse DNS (FCrDNS): reverse-resolve the requesting IP to a hostname, confirm the hostname belongs to the operator's published domain, then forward-resolve it back to the same IP. OpenAI, Anthropic, Google, and Perplexity all publish their IP ranges or verification domains for exactly this purpose. Legible's truth layer runs FCrDNS verification on crawler traffic as part of its machine-access checks, so "we're being crawled by GPTBot" claims are tested rather than assumed.
If you rate-limit or block by user-agent string alone, you'll punish real assistants and let impostors through.
Our recommendation
Stay open to search, citation, and on-demand fetchers — that traffic is a buyer or an answer that names you. Decide training access separately, on licensing grounds, knowing it changes nothing about this quarter's visibility. And whichever posture you choose, write it down deliberately: an accidental Disallow: / for OAI-SearchBot is the quietest way to vanish from the fastest-growing answer surface.
Want to see which crawlers can currently reach you, and what they find when they do? Run a free Legible report — machine access is one of the eight dimensions it scores.