One thing that still surprises me is how often robots.txt audits stop a few common bots.

Googlebot.
Bingbot.

…and maybe one or two others.

But the crawler ecosystem has expanded a lot.

Today, websites are being crawled not just by traditional search engines, but also by:

  • AI training crawlers

  • AI search indexing bots

  • Live user-triggered retrieval bots

  • Dataset collectors

  • SEO tool crawlers

  • Social preview crawlers

And each one of these interacts with your site through robots.txt via a User-agent rule.

Before getting into the crawlers themselves, it helps to understand what robots.txt actually controls and what it doesn’t.

What robots.txt Can (and Cannot) Control

This section is for beginners, if you are a pro. Please feel free to skip to the next section.

robots.txt is a crawl directive file, not a security layer.

It tells crawlers what they should or shouldn’t access, but it ultimately relies on the crawler respecting those rules.

Most major search engines and AI companies follow robots.txt.
But technically, it’s still advisory rather than enforceable.

These are the most common directives you’ll see:

Directive

What it Does

Support

User-agent

Defines which crawler the rule applies to

Supported by all major crawlers

Allow

Allows crawling of a specific path

Supported by Google and others

Disallow

Blocks the crawling of a specific path

Supported by all crawlers

Sitemap

Points crawlers to XML sitemaps

Supported by Google and Bing

Crawl-delay

Suggests a delay between crawl requests

Supported by Bing/Yandex (ignored by Google)

Host

Preferred domain declaration

Mainly supported by Yandex

In practice, most of the control SEOs use comes down to Allow and Disallow.

For example:

User-agent: GPTBot
Disallow: /

User-agent: Googlebot
Allow: /

In this case, the site blocks AI training crawlers but still allows traditional search indexing via GoogleBot.

That said, since robots.txt is advisory, some bots or scrapers may ignore it, and stronger restrictions would require server or firewall-level controls.

Major Crawlers SEOs Should Know

Traditional Search Engine Crawlers

Company

Crawler

What it Does

SEO Relevance

Google

Googlebot

Main crawler indexing pages for Google Search.

Determines how pages appear in Google search results.

Google

Googlebot-Mobile

Crawls pages using mobile-first indexing.

Influences how Google evaluates mobile usability.

Google

Googlebot-Image

Crawls images on websites.

Enables visibility in Google Image Search.

Google

Googlebot-Video

Crawls video content.

Allows videos to appear in video search results.

Google

Googlebot-News

Crawls news content.

Determines eligibility for Google News and Top Stories.

Google

Google-Extended

Controls whether content can be used for AI training.

Allows sites to opt out of Google AI training usage.

Microsoft

Bingbot

Crawls pages for Bing search.

Important because Bing powers Microsoft Copilot and other AI features.

Apple

Applebot

Crawls pages for Apple services.

Helps content appear in Siri and Spotlight search.

AI Platform Crawlers

Company

Crawler

What it Does

SEO / AI SEO Relevance

OpenAI

GPTBot

Collects data used for AI model training.

Determines whether site content contributes to AI training datasets.

OpenAI

OAI-SearchBot

Indexes pages for ChatGPT search answers.

Enables sites to be cited inside AI responses.

OpenAI

ChatGPT-User

Fetches pages during live queries.

Retrieves real-time information when ChatGPT answers questions.

Anthropic

ClaudeBot

Crawls the web for Claude training datasets.

Determines whether site content contributes to model training.

Anthropic

Claude-SearchBot

Indexes pages for Claude responses.

Allows AI answers to reference websites.

Anthropic

Claude-User

Fetches pages during live queries.

Retrieves real-time content for Claude's answers.

Perplexity

PerplexityBot

Crawls pages for Perplexity AI search.

Enables pages to appear as sources in AI answers.

Perplexity

Perplexity-User

Retrieves pages during live queries.

Fetches live information when answering queries.

AI Dataset Crawlers

Organization

Crawler

What it Does

SEO Relevance

Common Crawl

CCBot

Crawls the web to build large open datasets.

These datasets are often used to train AI models.

Diffbot

Diffbot

Extracts structured data from websites.

Builds knowledge graphs used by AI systems.

AI2

AI2Bot

Collects web content for research datasets.

Used in academic AI training projects.

xAI

xAI-Bot

Crawls web content for Grok AI training.

Determines whether site data enters xAI training datasets.

SEO Tool Crawlers

Company

Crawler

What it Does

SEO Relevance

Ahrefs

AhrefsBot

Crawls websites to collect backlink data.

Used for backlink analysis and competitor research.

Semrush

SemrushBot

Crawls sites to gather keyword and ranking data.

Supports SEO research tools.

Majestic

MJ12Bot

Crawls pages to build backlink intelligence datasets.

Used for link analysis.

Moz

DotBot

Crawls websites for link metrics.

Powers domain authority and link metrics.

Top Crawlers SEOs Should Actually Watch

While there are many crawlers out there, a few tend to matter the most today.

Crawler

Why It Matters

Googlebot

Still the primary crawler for organic search visibility.

Bingbot

Important because Bing powers Microsoft Copilot and other AI systems.

OAI-SearchBot

Relevant for visibility inside ChatGPT search results.

Claude-SearchBot

Determines whether Claude can reference your site.

PerplexityBot

Important for AI search engines focused on cited answers.

GPTBot

Controls whether content is used for AI model training.

Google-Extended

Allows sites to opt out of Google AI training usage.

Industry-Specific Crawler Focus

Different industries may naturally benefit from different crawler ecosystems. Below are a few crawlers grouped by industry for easier reference.

Note: This is only suggestive. Sit down with your team, review the use cases for each crawler, and implement what makes sense for your setup.

E-commerce

Focus on:

  • Googlebot

  • Googlebot-Image

  • Bingbot

Product discovery still happens largely through search results, shopping feeds, and image search.

SaaS & Technical Products

Focus on:

  • Googlebot

  • OAI-SearchBot

  • Claude-SearchBot

  • PerplexityBot

Technical documentation and guides are often referenced inside AI-generated answers.

Media & Publishing

Focus on:

  • Googlebot-News

  • Googlebot

  • Bingbot

News content frequently appears in Top Stories and AI summaries.

Local Businesses

Focus on:

  • Googlebot

  • Bingbot

  • Applebot

Local discovery increasingly involves search engines and voice assistants.

Educational & Research Websites

Focus on:

  • Googlebot

  • CCBot

  • GPTBot

  • ClaudeBot

Research and knowledge content often feed AI datasets and training material.

Bonus: CloudFlare Settings

If you use Cloudflare, then this is for you.

Cloudflare introduced Content Signals to give website owners more control over how crawlers use their content beyond simple crawling permissions.

Unlike traditional robots.txt rules, Content Signals specify how content may be used after it is accessed.

Note: Now the Crawlers each have their role set by the respective companies but from the infrastructre layer, you can control usage as well.

Example:

User-agent: *
Content-Signal: search=yes, ai-train=no, ai-input=no
Allow: /

Signal

What it Controls

SEO Benefit

search

Allows crawlers to index pages and show links/excerpts in search results

Lets the sites remain visible in search engines while controlling AI usage

ai-input

Allows content to be used as input for AI-generated answers (e.g., RAG systems, AI summaries)

Enables visibility in AI answer engines like ChatGPT, Perplexity

ai-train

Allows content to be used for training AI models

Lets the SEOs opt out of training datasets while keeping search visibility

Example use case:

Allow search but block training:

Content-Signal: search=yes, ai-train=no

This allows your pages to appear in search results but prevents them from being used to train AI models.

In addition, Cloudflare provides several tools that go beyond basic robots.txt controls to help manage how AI and SEO crawlers interact with a website. Through AI Crawl Control, site owners can monitor which AI bots are accessing their content and choose to allow or block them.

Managed robots.txt helps automatically maintain crawler rules, while Bot Management identifies verified bots and allows actions like allowing, blocking, or monitoring them.

Cloudflare also offers robots.txt analytics to track crawler behaviour and detect violations. It provides options to block certain AI crawlers by default to prevent unauthorized scraping, and is experimenting with Pay-Per-Crawl, which could allow websites to monetize AI crawler access in the future.

So if you are using Cloudflare, take full advantage of their bot/crawler management features.

Conclusion

robots.txt itself hasn’t changed much.

But the crawler ecosystem around it has changed a lot.

What used to be a small technical file for search engines is slowly becoming a very important directive layer for how your content participates in search, AI answers, and training datasets.

Which is why it’s probably worth looking beyond just a couple of bots the next time you audit. Because from now on, it will become a very important part of your (AI) SEO strategy.

Reply

Avatar

or to participate

Keep Reading