One thing that still surprises me is how often robots.txt audits stop a few common bots.

Googlebot.
Bingbot.

…and maybe one or two others.

But the crawler ecosystem has expanded a lot.

Today, websites are being crawled not just by traditional search engines, but also by:

AI training crawlers
AI search indexing bots
Live user-triggered retrieval bots
Dataset collectors
SEO tool crawlers
Social preview crawlers

And each one of these interacts with your site through robots.txt via a User-agent rule.

Before getting into the crawlers themselves, it helps to understand what robots.txt actually controls and what it doesn’t.

What robots.txt Can (and Cannot) Control

This section is for beginners, if you are a pro. Please feel free to skip to the next section.

robots.txt is a crawl directive file, not a security layer.

It tells crawlers what they should or shouldn’t access, but it ultimately relies on the crawler respecting those rules.

Most major search engines and AI companies follow robots.txt.
But technically, it’s still advisory rather than enforceable.

These are the most common directives you’ll see:

Directive	What it Does	Support
User-agent	Defines which crawler the rule applies to	Supported by all major crawlers
Allow	Allows crawling of a specific path	Supported by Google and others
Disallow	Blocks the crawling of a specific path	Supported by all crawlers
Sitemap	Points crawlers to XML sitemaps	Supported by Google and Bing
Crawl-delay	Suggests a delay between crawl requests	Supported by Bing/Yandex (ignored by Google)
Host	Preferred domain declaration	Mainly supported by Yandex

In practice, most of the control SEOs use comes down to Allow and Disallow.

For example:

User-agent: GPTBot
Disallow: /

User-agent: Googlebot
Allow: /

In this case, the site blocks AI training crawlers but still allows traditional search indexing via GoogleBot.

That said, since robots.txt is advisory, some bots or scrapers may ignore it, and stronger restrictions would require server or firewall-level controls.

Major Crawlers SEOs Should Know

Traditional Search Engine Crawlers

Company	Crawler	What it Does	SEO Relevance
Google	Googlebot	Main crawler indexing pages for Google Search.	Determines how pages appear in Google search results.
Google	Googlebot-Mobile	Crawls pages using mobile-first indexing.	Influences how Google evaluates mobile usability.
Google	Googlebot-Image	Crawls images on websites.	Enables visibility in Google Image Search.
Google	Googlebot-Video	Crawls video content.	Allows videos to appear in video search results.
Google	Googlebot-News	Crawls news content.	Determines eligibility for Google News and Top Stories.
Google	Google-Extended	Controls whether content can be used for AI training.	Allows sites to opt out of Google AI training usage.
Microsoft	Bingbot	Crawls pages for Bing search.	Important because Bing powers Microsoft Copilot and other AI features.
Apple	Applebot	Crawls pages for Apple services.	Helps content appear in Siri and Spotlight search.

AI Platform Crawlers

Company	Crawler	What it Does	SEO / AI SEO Relevance
OpenAI	GPTBot	Collects data used for AI model training.	Determines whether site content contributes to AI training datasets.
OpenAI	OAI-SearchBot	Indexes pages for ChatGPT search answers.	Enables sites to be cited inside AI responses.
OpenAI	ChatGPT-User	Fetches pages during live queries.	Retrieves real-time information when ChatGPT answers questions.
Anthropic	ClaudeBot	Crawls the web for Claude training datasets.	Determines whether site content contributes to model training.
Anthropic	Claude-SearchBot	Indexes pages for Claude responses.	Allows AI answers to reference websites.
Anthropic	Claude-User	Fetches pages during live queries.	Retrieves real-time content for Claude's answers.
Perplexity	PerplexityBot	Crawls pages for Perplexity AI search.	Enables pages to appear as sources in AI answers.
Perplexity	Perplexity-User	Retrieves pages during live queries.	Fetches live information when answering queries.

AI Dataset Crawlers

Organization	Crawler	What it Does	SEO Relevance
Common Crawl	CCBot	Crawls the web to build large open datasets.	These datasets are often used to train AI models.
Diffbot	Diffbot	Extracts structured data from websites.	Builds knowledge graphs used by AI systems.
AI2	AI2Bot	Collects web content for research datasets.	Used in academic AI training projects.
xAI	xAI-Bot	Crawls web content for Grok AI training.	Determines whether site data enters xAI training datasets.

SEO Tool Crawlers

Company	Crawler	What it Does	SEO Relevance
Ahrefs	AhrefsBot	Crawls websites to collect backlink data.	Used for backlink analysis and competitor research.
Semrush	SemrushBot	Crawls sites to gather keyword and ranking data.	Supports SEO research tools.
Majestic	MJ12Bot	Crawls pages to build backlink intelligence datasets.	Used for link analysis.
Moz	DotBot	Crawls websites for link metrics.	Powers domain authority and link metrics.

Top Crawlers SEOs Should Actually Watch

While there are many crawlers out there, a few tend to matter the most today.

Crawler	Why It Matters
Googlebot	Still the primary crawler for organic search visibility.
Bingbot	Important because Bing powers Microsoft Copilot and other AI systems.
OAI-SearchBot	Relevant for visibility inside ChatGPT search results.
Claude-SearchBot	Determines whether Claude can reference your site.
PerplexityBot	Important for AI search engines focused on cited answers.
GPTBot	Controls whether content is used for AI model training.
Google-Extended	Allows sites to opt out of Google AI training usage.

Industry-Specific Crawler Focus

Different industries may naturally benefit from different crawler ecosystems. Below are a few crawlers grouped by industry for easier reference.

Note: This is only suggestive. Sit down with your team, review the use cases for each crawler, and implement what makes sense for your setup.

E-commerce

Focus on:

Googlebot
Googlebot-Image
Bingbot

Product discovery still happens largely through search results, shopping feeds, and image search.

SaaS & Technical Products

Focus on:

Googlebot
OAI-SearchBot
Claude-SearchBot
PerplexityBot

Technical documentation and guides are often referenced inside AI-generated answers.

Media & Publishing

Focus on:

Googlebot-News
Googlebot
Bingbot

News content frequently appears in Top Stories and AI summaries.

Local Businesses

Focus on:

Googlebot
Bingbot
Applebot

Local discovery increasingly involves search engines and voice assistants.

Educational & Research Websites

Focus on:

Googlebot
CCBot
GPTBot
ClaudeBot

Research and knowledge content often feed AI datasets and training material.

Bonus: CloudFlare Settings

If you use Cloudflare, then this is for you.

Cloudflare introduced Content Signals to give website owners more control over how crawlers use their content beyond simple crawling permissions.

Unlike traditional robots.txt rules, Content Signals specify how content may be used after it is accessed.

Note: Now the Crawlers each have their role set by the respective companies but from the infrastructre layer, you can control usage as well.

Example:

User-agent: *
Content-Signal: search=yes, ai-train=no, ai-input=no
Allow: /

Signal	What it Controls	SEO Benefit
search	Allows crawlers to index pages and show links/excerpts in search results	Lets the sites remain visible in search engines while controlling AI usage
ai-input	Allows content to be used as input for AI-generated answers (e.g., RAG systems, AI summaries)	Enables visibility in AI answer engines like ChatGPT, Perplexity
ai-train	Allows content to be used for training AI models	Lets the SEOs opt out of training datasets while keeping search visibility

Example use case:

Allow search but block training:

Content-Signal: search=yes, ai-train=no

This allows your pages to appear in search results but prevents them from being used to train AI models.

In addition, Cloudflare provides several tools that go beyond basic robots.txt controls to help manage how AI and SEO crawlers interact with a website. Through AI Crawl Control, site owners can monitor which AI bots are accessing their content and choose to allow or block them.

Managed robots.txt helps automatically maintain crawler rules, while Bot Management identifies verified bots and allows actions like allowing, blocking, or monitoring them.

Cloudflare also offers robots.txt analytics to track crawler behaviour and detect violations. It provides options to block certain AI crawlers by default to prevent unauthorized scraping, and is experimenting with Pay-Per-Crawl, which could allow websites to monetize AI crawler access in the future.

So if you are using Cloudflare, take full advantage of their bot/crawler management features.

Conclusion

robots.txt itself hasn’t changed much.

But the crawler ecosystem around it has changed a lot.

What used to be a small technical file for search engines is slowly becoming a very important directive layer for how your content participates in search, AI answers, and training datasets.

Which is why it’s probably worth looking beyond just a couple of bots the next time you audit. Because from now on, it will become a very important part of your (AI) SEO strategy.

Beyond Googlebot: The Crawlers and AI Bots SEOs Should Know Today

What robots.txt Can (and Cannot) Control

Major Crawlers SEOs Should Know

Traditional Search Engine Crawlers

AI Platform Crawlers

AI Dataset Crawlers

SEO Tool Crawlers

Top Crawlers SEOs Should Actually Watch

Industry-Specific Crawler Focus

E-commerce

SaaS & Technical Products

Media & Publishing

Local Businesses

Educational & Research Websites

Bonus: CloudFlare Settings

Conclusion

Reply

Keep Reading

The Citation Cult