Lesli.com -- AI Visibility & SEO

Robots.txt and AI Crawlers:
What to Allow, What to Block

By Lesli Rose · April 12, 2026 · 10 min read

There are at least ten AI crawlers hitting your website right now. Some power AI search. Some scrape your content for model training. Your robots.txt file controls which ones get in. Most businesses either block them all by accident or let everything through without understanding the tradeoffs. Here is exactly how to handle each one.

What Robots.txt Actually Does for AI

Your robots.txt file lives at yourdomain.com/robots.txt. It is a plain text file that tells crawlers what they can and cannot access on your site. For decades, the only crawler most people cared about was Googlebot. That changed in 2023 when OpenAI, Anthropic, and others launched AI crawlers that read your site for different reasons.

The critical thing to understand: not all AI crawlers do the same thing. Some crawl your site so their AI can recommend you in real-time search results. Others crawl your site to train their models on your content. These are fundamentally different use cases, and your robots.txt should treat them differently.

Every AI Crawler You Need to Know

Here is the complete list of AI crawlers active as of early 2026, what they do, and whether you should allow them.

Search-Oriented Crawlers (Allow These)

OAI-SearchBot -- OpenAI's search crawler. This is how ChatGPT cites your site in real-time answers. Blocking this means ChatGPT cannot reference your pages when users search. Allow it.

ChatGPT-User -- The crawler ChatGPT uses when a user explicitly asks it to browse a URL. If someone pastes your link into ChatGPT, this crawler fetches the page. Allow it.

Claude-SearchBot -- Anthropic's search crawler for Claude. Powers real-time web search in Claude. Allow it.

PerplexityBot -- Perplexity's crawler. Perplexity is the fastest-growing AI search engine and it cites sources directly. Allow it.

Google-Extended -- Controls whether Google uses your content in AI Overviews and Gemini. Separate from Googlebot (which handles regular search). Allow it.

Applebot-Extended -- Powers Apple Intelligence, Siri, and AI features across all Apple devices. Allow it.

Training-Only Crawlers (Block or Allow Based on Your Business)

GPTBot -- OpenAI's training crawler. Gathers content for model training, not for real-time search. Most lead-generation businesses can safely block this one while still allowing OAI-SearchBot.

ClaudeBot -- Anthropic's general crawler used for training data. You can block this while allowing Claude-SearchBot if you only want search citations without model training.

Bytespider -- ByteDance's crawler (TikTok's parent company). Used for training. Most businesses block this one because ByteDance does not offer a consumer AI search product that recommends businesses.

CCBot -- Common Crawl's bot. Builds an open dataset used by many AI companies for training. Block if you want to limit training data exposure. Allow if you want maximum reach.

The Difference Between Search and Training

This is the distinction most people miss. When OAI-SearchBot crawls your site, it is reading your content so ChatGPT can cite you as a source in a conversation happening right now. That is direct visibility -- someone asks a question, ChatGPT reads your page, and recommends your business with a link.

When GPTBot crawls your site, it is collecting content for OpenAI's next model training run. Your content gets mixed into the training data. The model might "know" about your business in a general sense, but it is not citing you directly. There is no link. There is no attribution.

For most businesses, the search crawlers are the ones that matter. They are the ones that drive real recommendations with real citations. The training crawlers are a judgment call -- they increase the chance an AI "knows" about your business in its base knowledge, but the tradeoff is your content being used without attribution.

The Copy-Paste Robots.txt Template

Here is the robots.txt I recommend for most lead-generation businesses. It allows all search-oriented AI crawlers while blocking training-only crawlers and protecting sensitive directories.

# === AI Search Crawlers (ALLOW) ===

User-agent: OAI-SearchBot

Allow: /

Disallow: /admin/

User-agent: ChatGPT-User

Allow: /

Disallow: /admin/

User-agent: Claude-SearchBot

Allow: /

Disallow: /admin/

User-agent: PerplexityBot

Allow: /

Disallow: /admin/

User-agent: Google-Extended

Allow: /

User-agent: Applebot-Extended

Allow: /

# === Training Crawlers (BLOCK) ===

User-agent: GPTBot

Disallow: /

User-agent: ClaudeBot

Disallow: /

User-agent: Bytespider

Disallow: /

User-agent: CCBot

Disallow: /

# === Standard Crawlers ===

User-agent: *

Allow: /

Disallow: /admin/

Disallow: /private/

Sitemap: https://yourdomain.com/sitemap.xml

If you are comfortable with training crawlers accessing your content (and there is an argument for maximum exposure), you can change the training crawlers section to Allow: / instead. The important thing is making that decision intentionally, not by accident.

Common Mistakes I See in Every Audit

I have run dozens of AI visibility audits. The robots.txt mistakes fall into predictable patterns.

Blocking everything with a wildcard. Some sites have "User-agent: * / Disallow: /" which blocks every crawler, including AI search bots. This is the nuclear option and it is almost never what the site owner intended.

WordPress plugin defaults. Wordfence, Sucuri, and All In One Security block "unknown bots" by default. AI crawlers are still new enough that they get flagged as unknown. The block happens at the firewall level, so it does not even show up in robots.txt -- making it invisible to most site owners.

Wix and Squarespace defaults. These platforms generate robots.txt automatically. Their defaults may not include AI crawler directives at all. If you are on a managed platform, check whether you can even edit your robots.txt -- and if you cannot, that is a problem.

Niche builders that ignore AI entirely. I recently audited a breeder website built on Breederoo that had no robots.txt at all. No robots.txt means everything is allowed by default -- which sounds fine until you realize the site also had no sitemap, no schema, and no crawl path for AI to follow. Open doors do not help if the house is empty.

Treating all AI crawlers the same. Blocking GPTBot and OAI-SearchBot together because they are both from OpenAI. These are different crawlers with different purposes. You can block one and allow the other.

Real Audit Findings

Last month I audited a veterinary clinic that was ranking well in Google but completely invisible to ChatGPT. Their Wordfence plugin had a "block AI bots" setting enabled. The owner had no idea the setting existed -- it was turned on during a security hardening session by their web developer. Five minutes to fix. Months of lost AI visibility.

Another audit -- a SaaS company with 200 pages of documentation. Their robots.txt blocked GPTBot. Reasonable choice for a training crawler. But they had also blocked OAI-SearchBot and ChatGPT-User in the same rule, which meant ChatGPT could not reference their documentation when users asked about their software category. Their competitors who allowed the search crawlers showed up in every ChatGPT recommendation query.

The pattern is the same every time. The site owner did not know about the block, or they blocked all AI crawlers without understanding the search vs training distinction. The fix is always simple. The cost of not fixing it compounds every day.

How to Check Yours Right Now

Open a browser. Go to yourdomain.com/robots.txt. Look for any mention of GPTBot, OAI-SearchBot, ChatGPT-User, ClaudeBot, Claude-SearchBot, PerplexityBot, Google-Extended, Applebot-Extended, Bytespider, or CCBot. If any search-oriented crawlers have "Disallow: /" after them, fix it immediately.

Then check your security plugins and CDN settings. Cloudflare, Wordfence, Sucuri -- all of them can block AI crawlers at the firewall level without touching your robots.txt. Look for bot management, firewall rules, or AI bot settings. If you find blanket blocks, whitelist the search-oriented crawlers at minimum.

If you do not see any AI crawlers mentioned in your robots.txt at all, that is actually fine. Robots.txt works on an allow-by-default basis. No mention means they are allowed. The problem is only when they are explicitly blocked -- or blocked at the firewall level.

Frequently Asked Questions

Should I block AI crawlers that train models on my content?

It depends on your business model. If your content is the product, blocking training crawlers makes sense. If your site generates leads, allow search crawlers and make a separate decision about training crawlers. The key is distinguishing between the two types.

What is the difference between GPTBot and OAI-SearchBot?

GPTBot collects content for model training. OAI-SearchBot reads your pages so ChatGPT can cite you in real-time search results. You can block one and allow the other. Most businesses should allow OAI-SearchBot regardless of their stance on GPTBot.

Does my website builder automatically block AI crawlers?

Many do. WordPress security plugins block unknown bots by default. Wix and Squarespace generate their own robots.txt. Niche builders often ignore AI crawlers entirely. Always check yourdomain.com/robots.txt directly and review your security plugin settings.

How often should I update my robots.txt for AI crawlers?

Review quarterly. New AI crawlers launch regularly and existing ones change user-agent strings. Set a calendar reminder to check for new crawlers every three months and update your directives.

Not Sure What Your Robots.txt Is Doing?

I'll check your robots.txt, security plugins, and CDN settings for AI crawler blocks -- and tell you exactly what to fix. AI Visibility Action Plan, no commitment.

Run Your AI Visibility Action Plan