One thing that still surprises me is how often robots.txt audits stop a few common bots.
Googlebot.
Bingbot.
…and maybe one or two others.
But the crawler ecosystem has expanded a lot.
Today, websites are being crawled not just by traditional search engines, but also by:
AI training crawlers
AI search indexing bots
Live user-triggered retrieval bots
Dataset collectors
SEO tool crawlers
Social preview crawlers
And each one of these interacts with your site through robots.txt via a User-agent rule.
Before getting into the crawlers themselves, it helps to understand what robots.txt actually controls and what it doesn’t.
What robots.txt Can (and Cannot) Control
This section is for beginners, if you are a pro. Please feel free to skip to the next section.
robots.txt is a crawl directive file, not a security layer.
It tells crawlers what they should or shouldn’t access, but it ultimately relies on the crawler respecting those rules.
Most major search engines and AI companies follow robots.txt.
But technically, it’s still advisory rather than enforceable.
These are the most common directives you’ll see:
Directive | What it Does | Support |
|---|---|---|
User-agent | Defines which crawler the rule applies to | Supported by all major crawlers |
Allow | Allows crawling of a specific path | Supported by Google and others |
Disallow | Blocks the crawling of a specific path | Supported by all crawlers |
Sitemap | Points crawlers to XML sitemaps | Supported by Google and Bing |
Crawl-delay | Suggests a delay between crawl requests | Supported by Bing/Yandex (ignored by Google) |
Host | Preferred domain declaration | Mainly supported by Yandex |
In practice, most of the control SEOs use comes down to Allow and Disallow.
For example:
User-agent: GPTBot
Disallow: /
User-agent: Googlebot
Allow: /
In this case, the site blocks AI training crawlers but still allows traditional search indexing via GoogleBot.
That said, since robots.txt is advisory, some bots or scrapers may ignore it, and stronger restrictions would require server or firewall-level controls.
Major Crawlers SEOs Should Know
Traditional Search Engine Crawlers
Company | Crawler | What it Does | SEO Relevance |
|---|---|---|---|
Googlebot | Main crawler indexing pages for Google Search. | Determines how pages appear in Google search results. | |
Googlebot-Mobile | Crawls pages using mobile-first indexing. | Influences how Google evaluates mobile usability. | |
Googlebot-Image | Crawls images on websites. | Enables visibility in Google Image Search. | |
Googlebot-Video | Crawls video content. | Allows videos to appear in video search results. | |
Googlebot-News | Crawls news content. | Determines eligibility for Google News and Top Stories. | |
Google-Extended | Controls whether content can be used for AI training. | Allows sites to opt out of Google AI training usage. | |
Microsoft | Bingbot | Crawls pages for Bing search. | Important because Bing powers Microsoft Copilot and other AI features. |
Apple | Applebot | Crawls pages for Apple services. | Helps content appear in Siri and Spotlight search. |
AI Platform Crawlers
Company | Crawler | What it Does | SEO / AI SEO Relevance |
|---|---|---|---|
OpenAI | GPTBot | Collects data used for AI model training. | Determines whether site content contributes to AI training datasets. |
OpenAI | OAI-SearchBot | Indexes pages for ChatGPT search answers. | Enables sites to be cited inside AI responses. |
OpenAI | ChatGPT-User | Fetches pages during live queries. | Retrieves real-time information when ChatGPT answers questions. |
Anthropic | ClaudeBot | Crawls the web for Claude training datasets. | Determines whether site content contributes to model training. |
Anthropic | Claude-SearchBot | Indexes pages for Claude responses. | Allows AI answers to reference websites. |
Anthropic | Claude-User | Fetches pages during live queries. | Retrieves real-time content for Claude's answers. |
Perplexity | PerplexityBot | Crawls pages for Perplexity AI search. | Enables pages to appear as sources in AI answers. |
Perplexity | Perplexity-User | Retrieves pages during live queries. | Fetches live information when answering queries. |
AI Dataset Crawlers
Organization | Crawler | What it Does | SEO Relevance |
|---|---|---|---|
Common Crawl | CCBot | Crawls the web to build large open datasets. | These datasets are often used to train AI models. |
Diffbot | Diffbot | Extracts structured data from websites. | Builds knowledge graphs used by AI systems. |
AI2 | AI2Bot | Collects web content for research datasets. | Used in academic AI training projects. |
xAI | xAI-Bot | Crawls web content for Grok AI training. | Determines whether site data enters xAI training datasets. |
SEO Tool Crawlers
Company | Crawler | What it Does | SEO Relevance |
|---|---|---|---|
Ahrefs | AhrefsBot | Crawls websites to collect backlink data. | Used for backlink analysis and competitor research. |
Semrush | SemrushBot | Crawls sites to gather keyword and ranking data. | Supports SEO research tools. |
Majestic | MJ12Bot | Crawls pages to build backlink intelligence datasets. | Used for link analysis. |
Moz | DotBot | Crawls websites for link metrics. | Powers domain authority and link metrics. |
Top Crawlers SEOs Should Actually Watch
While there are many crawlers out there, a few tend to matter the most today.
Crawler | Why It Matters |
|---|---|
Googlebot | Still the primary crawler for organic search visibility. |
Bingbot | Important because Bing powers Microsoft Copilot and other AI systems. |
OAI-SearchBot | Relevant for visibility inside ChatGPT search results. |
Claude-SearchBot | Determines whether Claude can reference your site. |
PerplexityBot | Important for AI search engines focused on cited answers. |
GPTBot | Controls whether content is used for AI model training. |
Google-Extended | Allows sites to opt out of Google AI training usage. |
Industry-Specific Crawler Focus
Different industries may naturally benefit from different crawler ecosystems. Below are a few crawlers grouped by industry for easier reference.
Note: This is only suggestive. Sit down with your team, review the use cases for each crawler, and implement what makes sense for your setup.
E-commerce
Focus on:
Googlebot
Googlebot-Image
Bingbot
Product discovery still happens largely through search results, shopping feeds, and image search.
SaaS & Technical Products
Focus on:
Googlebot
OAI-SearchBot
Claude-SearchBot
PerplexityBot
Technical documentation and guides are often referenced inside AI-generated answers.
Media & Publishing
Focus on:
Googlebot-News
Googlebot
Bingbot
News content frequently appears in Top Stories and AI summaries.
Local Businesses
Focus on:
Googlebot
Bingbot
Applebot
Local discovery increasingly involves search engines and voice assistants.
Educational & Research Websites
Focus on:
Googlebot
CCBot
GPTBot
ClaudeBot
Research and knowledge content often feed AI datasets and training material.
Bonus: CloudFlare Settings
If you use Cloudflare, then this is for you.
Cloudflare introduced Content Signals to give website owners more control over how crawlers use their content beyond simple crawling permissions.
Unlike traditional robots.txt rules, Content Signals specify how content may be used after it is accessed.
Note: Now the Crawlers each have their role set by the respective companies but from the infrastructre layer, you can control usage as well.
Example:
User-agent: *
Content-Signal: search=yes, ai-train=no, ai-input=no
Allow: /Signal | What it Controls | SEO Benefit |
|---|---|---|
search | Allows crawlers to index pages and show links/excerpts in search results | Lets the sites remain visible in search engines while controlling AI usage |
ai-input | Allows content to be used as input for AI-generated answers (e.g., RAG systems, AI summaries) | Enables visibility in AI answer engines like ChatGPT, Perplexity |
ai-train | Allows content to be used for training AI models | Lets the SEOs opt out of training datasets while keeping search visibility |
Example use case:
Allow search but block training:
Content-Signal: search=yes, ai-train=noThis allows your pages to appear in search results but prevents them from being used to train AI models.
In addition, Cloudflare provides several tools that go beyond basic robots.txt controls to help manage how AI and SEO crawlers interact with a website. Through AI Crawl Control, site owners can monitor which AI bots are accessing their content and choose to allow or block them.
Managed robots.txt helps automatically maintain crawler rules, while Bot Management identifies verified bots and allows actions like allowing, blocking, or monitoring them.
Cloudflare also offers robots.txt analytics to track crawler behaviour and detect violations. It provides options to block certain AI crawlers by default to prevent unauthorized scraping, and is experimenting with Pay-Per-Crawl, which could allow websites to monetize AI crawler access in the future.
So if you are using Cloudflare, take full advantage of their bot/crawler management features.
Conclusion
robots.txt itself hasn’t changed much.
But the crawler ecosystem around it has changed a lot.
What used to be a small technical file for search engines is slowly becoming a very important directive layer for how your content participates in search, AI answers, and training datasets.
Which is why it’s probably worth looking beyond just a couple of bots the next time you audit. Because from now on, it will become a very important part of your (AI) SEO strategy.
