AI Crawler User-Agents
Last updated
AI platforms typically operate multiple crawlers for distinct purposes. The user-agent string identifies which system is accessing your site. Whether to allow or block a particular crawler depends on its purpose, not just the company that operates it.
What is the difference between a training crawler and a retrieval crawler?
Training crawlers fetch content to build or update model weights. Blocking them stops your content from being used in future training runs. It has no effect on whether that platform’s AI products currently cite or surface your site.
Retrieval crawlers index or fetch content for live search and answer generation. Blocking them removes your site from that platform’s AI search results.
The two are separate systems. Some operators split them cleanly across different user-agents: OpenAI uses GPTBot for training and OAI-SearchBot for search indexing. Anthropic uses ClaudeBot for training, Claude-SearchBot for search indexing, and Claude-User for live page fetches when a user’s query requires it. Others, such as Perplexity, use a single crawler (PerplexityBot) that serves both indexing and live retrieval.
Because the decisions are independent, you can block training crawlers while leaving retrieval crawlers untouched, or vice versa.
Which AI crawlers are currently active?
| User-agent | Purpose |
|---|---|
GPTBot (OpenAI)1 | Training |
OAI-SearchBot (OpenAI)1 | ChatGPT Search indexing |
ChatGPT-User (OpenAI)1 | Live retrieval (user-triggered) |
ClaudeBot (Anthropic)2 | Training |
Claude-SearchBot (Anthropic)2 | Search indexing |
Claude-User (Anthropic)2 | Live retrieval (user-triggered) |
claude-code (Anthropic)2 | Claude Code CLI URL fetches |
PerplexityBot (Perplexity)3 | Search indexing and live retrieval |
Perplexity-User (Perplexity)3 | Live retrieval |
Google-Extended (Google)4 | Gemini and Vertex AI training |
CCBot (Common Crawl)5 | Open training dataset (used by many providers) |
Amazonbot (Amazon)6 | Training and product improvement |
Applebot (Apple)7 | Siri and Spotlight indexing |
Applebot-Extended (Apple)7 | Generative AI features training opt-out |
meta-externalagent (Meta)8 | AI training |
YouBot (You.com)9 | Search indexing |
ByteSpider (ByteDance)10 | Training and search |
Note on ChatGPT-User: OpenAI updated its documentation in December 2025 to remove its commitment to honouring robots.txt for ChatGPT-User.11 Unlike the other crawlers listed here, you cannot rely on robots.txt alone to control it.
Note on ByteSpider: ByteDance does not publish IP verification data and ByteSpider has a documented history of robots.txt non-compliance.10 IP-level blocking is more reliable than robots.txt for this crawler.
Note on Microsoft Web IQ: Microsoft’s Web IQ grounding API, which powers Copilot and ChatGPT Search responses, draws from Bing’s existing index. No separate Web IQ crawler user-agent exists. BingBot governs what Web IQ can access, and Web IQ inherits Bing’s existing robots.txt compliance. Microsoft has stated it is engaging with the IETF on formalising publisher controls for the AI era.
How do you verify a crawler is legitimate?
A user-agent string is self-reported. Any client can claim to be GPTBot or Googlebot. Verification requires checking the source IP, not the string.
Reverse DNS lookup: perform a reverse DNS lookup on the IP that made the request. The resulting hostname should match the crawler’s documented domain (e.g. googlebot.com for Googlebot). Then perform a forward DNS lookup on that hostname and confirm it resolves back to the same IP. This forward-confirmed reverse DNS check is the standard verification method. See Log File Analysis for how to run this check against your server logs.
Published IP lists: several operators provide machine-readable IP ranges:
- OpenAI: published in the bots documentation1
- Perplexity:
https://www.perplexity.com/perplexitybot.jsonandhttps://www.perplexity.com/perplexity-user.json3 - Common Crawl:
https://index.commoncrawl.org/ccbot.json5 - Amazon: published in the Amazonbot documentation6
Anthropic does not publish IP ranges for its crawlers. Their bots run on public cloud provider addresses, making IP-based blocking unreliable. Reverse DNS verification is the only reliable method for confirming Anthropic crawler identity.
For crawlers without a published IP list (Anthropic, ByteSpider, YouBot, meta-externalagent), log-based pattern analysis and reverse DNS are the available options.
Which crawlers should you allow or block?
The robots.txt syntax for targeting specific crawlers is covered in Crawlability and robots.txt. The decision of what to do depends on your situation:
No specific concern: the default, where no explicit rules apply to these agents, allows all well-behaved crawlers. Most sites are in this position.
Want AI citations, not training contribution: block training crawlers (GPTBot, ClaudeBot, Google-Extended, CCBot) while leaving retrieval crawlers (OAI-SearchBot, ChatGPT-User, PerplexityBot, Perplexity-User, Claude-SearchBot, Claude-User) untouched. Blocking training crawlers has no effect on search citations.
Paywalled or licensed content: blocking all AI crawlers is a defensible position. Training crawlers extract value without payment; retrieval crawlers may surface excerpts without driving traffic. Be aware that ChatGPT-User may not reliably respect robots.txt directives.
ByteSpider: add IP-level blocks via your hosting platform or CDN in addition to robots.txt rules, given the documented non-compliance.
Perplexity-User: robots.txt is not effective. Use server-side access controls (IP blocking or authentication) to prevent Perplexity-User from fetching content. PerplexityBot (the indexing crawler) does respect robots.txt and can be controlled normally.
In August 2025, Cloudflare published research documenting a secondary Perplexity crawler that spoofs a standard Chrome on macOS user-agent string and rotates IP ranges when declared Perplexity crawlers are blocked, generating millions of additional daily requests to affected sites.12 Cloudflare subsequently de-listed Perplexity as a verified bot and added managed-rule heuristics to detect and block the behaviour. Perplexity disputed the characterisation, stating the activity relates to user-triggered AI Assistant requests rather than automated crawling. Site owners who need to restrict Perplexity access should use CDN-level managed rules rather than relying on IP blocklists alone.
Frequently asked questions
Does blocking GPTBot affect ChatGPT search results?
No. OAI-SearchBot handles ChatGPT Search indexing. GPTBot is a training crawler only. Blocking it has no effect on whether your site appears in ChatGPT search answers.
Does blocking ClaudeBot affect Claude’s answers?
No. ClaudeBot is Anthropic’s training crawler. Claude-SearchBot handles search indexing, and Claude-User handles live retrieval. Blocking ClaudeBot only affects whether your content is used in future training data.
Can I block one platform’s training crawler but not another’s?
Yes. Each has a distinct user-agent string. You can write separate robots.txt rules for each.
Do all AI crawlers respect robots.txt?
There is no legal requirement to do so. Most major operators comply by policy. ByteSpider is a documented exception.10 ChatGPT-User removed its robots.txt commitment from its documentation in December 2025.11 Perplexity-User does not respect robots.txt by design; server-side access controls are required to block it.3
How do I see which AI crawlers are actually hitting my site?
Server log analysis is the most accurate method. See Log File Analysis.
Footnotes
-
Does Anthropic crawl data from the web, and how can site owners block the crawler? — Anthropic Help Center ↩ ↩2 ↩3 ↩4
-
Scrapers selectively respect robots.txt directives — arXiv 2505.21733 (2025) ↩ ↩2 ↩3
-
ChatGPT-User robots.txt change, December 2025 — noted in OpenAI bots documentation ↩ ↩2
-
Perplexity is using stealth, undeclared crawlers to evade website no-crawl directives — Cloudflare Blog ↩