GPTBot: OpenAI's four crawlers in 2026
GPTBot is one of four bots. Cloudflare's one-click 'block AI bots' toggle blocks the wrong ones for most sites. Here's the four-bot map, the training-vs-inference distinction nobody else makes correctly, and three robots.txt recipes you can copy.
GPTBot is one of four crawlers OpenAI runs. GPTBot trains the next model. OAI-SearchBot indexes the web for ChatGPT search. ChatGPT-User fetches pages on demand when a user clicks a citation. OAI-AdsBot crawls advertiser landing pages. Block GPTBot and your content stays out of the next training run; block OAI-SearchBot or ChatGPT-User and you become invisible inside ChatGPT entirely. They are not the same decision.
I built TurboAudit, a tool that audits AI search visibility (crawler accessibility is one of the things it checks). I have no relationship with OpenAI, no bot-blocking product to sell, and no preference for whether you allow or block any of these crawlers. Every user-agent string, IP range, and code sample below was verified against the vendor's own documentation in June 2026. When OpenAI rotates user agents — and they do — this page becomes wrong. Re-verify before shipping configuration based on it.
The four OpenAI bots
Most pages treat GPTBot as one thing. As of June 2026, OpenAI publicly documents four crawlers. They have different purposes and different blocking implications. The table below maps them.
These user agents were correct on June 20, 2026. OpenAI's canonical reference is developers.openai.com/api/docs/bots — verify there before shipping any configuration. OpenAI rotates user-agent versions occasionally; this page will be updated when they do, but the canonical source remains theirs.
Training versus inference: why the distinction matters
Of all the things this page tries to argue, this one is load-bearing. Read it twice if you read nothing else.
AI crawlers fall into two categories with very different consequences when blocked. Training bots — GPTBot, ClaudeBot, Google-Extended, Applebot-Extended, Meta-ExternalAgent — crawl the web to update the model's training data. The model is then frozen at a particular cutoff date and ships. Blocking these bots means: your content will not be part of the next model release. The effect is delayed and indirect — you affect what future models know about you, not what the current ChatGPT or Claude or Gemini knows.
Inference and citation bots — OAI-SearchBot, ChatGPT-User, PerplexityBot, Perplexity-User, Claude-SearchBot, Claude-User — crawl the web to serve a specific user query. They build the live index that AI search results pull from, and they fetch pages on demand when a user clicks a citation. Blocking these bots means: you will not appear in answers users are receiving right now. The effect is immediate, direct, and visible in your AI search visibility tooling within days.
These are different decisions because the consequences are different. A privacy-conscious site might reasonably want to block training (don't put my content in the model) while keeping inference open (people who ask should still find me). A site that has decided AI search is not their distribution channel might block both. A site that wants maximum AI distribution should allow both.
Most published guidance conflates the two. The standard advice — "add `User-agent: GPTBot Disallow: /` to your robots.txt" — blocks only one of OpenAI's four bots. Cloudflare's popular one-click "Block AI Bots" toggle blocks GPTBot and OAI-SearchBot together. Sites that flip it lose their ChatGPT search citations and frequently do not know why. The toggle does what its label says; the label hides the distinction.
Should you block GPTBot? Four scenarios
Stop asking "should I block GPTBot" as if it's one question. It's four questions: which combination of training and inference bots fits your situation. Below are the four real configurations sites are running in 2026 and when each one makes sense.
- 01
Block training, allow inference
BlocksGPTBot, ClaudeBot, Google-ExtendedAllowsOAI-SearchBot, ChatGPT-User, Perplexity*, Claude-SearchBot, Claude-UserThe have-your-cake-and-eat-it config. Your content does not train the next model — useful if you care about IP or compete with whoever built the model — but you remain citable in ChatGPT search, Perplexity, and Claude search results. Picked by premium publishers and many SaaS marketing sites that want AI visibility without contributing to training. Probably the right default for most brands as of mid-2026.
- 02
Block everything
BlocksAll AI training and inference botsAllowsGooglebot only (and other non-AI search engines)The IP-protective config. You accept invisibility in AI search in exchange for total exclusion from training and live AI citations. Picked by some news publishers (NYT, FT, WSJ at various points), high-end legal and medical sites, and brands whose lawyers have advised maximum exclusion. Honest about the trade-off: you opt out of an entire distribution channel that is growing.
- 03
Allow everything
BlocksNothingAllowsAll AI crawlersThe maximum-visibility config. Your content trains future models and appears in current AI citations. Picked by SaaS marketing sites that want to be the default LLM reference for their category, documentation sites that want to become the canonical answer for their tool, and most content marketing operations. The right answer for sites whose business model is being found and cited, including most B2B SaaS.
- 04
Allow training, block inference
BlocksOAI-SearchBot, ChatGPT-User, PerplexityBot, Claude-SearchBotAllowsGPTBot, ClaudeBotAlmost always a mistake. You contribute your content to training future models but do not appear in live answers. Opt-in pollution. The only sites for which this makes sense are ones doing some specific research on LLM training corpora, and they tend to know they are an edge case.
The robots.txt cookbook
Three recipes, each named after the scenario above, copyable as-is. Verified against vendor docs on June 20, 2026. Each one is annotated with what it does and what it deliberately leaves out.
Recipe 1 — Block training, allow inference
Most common modern config. Stay out of training, remain citable in AI search.# Block OpenAI training (GPTBot) User-agent: GPTBot Disallow: / # Block Anthropic training (ClaudeBot) User-agent: ClaudeBot Disallow: / # Block Google AI training (Google-Extended) User-agent: Google-Extended Disallow: / # Block Apple Intelligence training (Applebot-Extended) User-agent: Applebot-Extended Disallow: / # Block Meta AI training User-agent: Meta-ExternalAgent Disallow: / # Explicitly allow inference / citation bots User-agent: OAI-SearchBot Allow: / User-agent: ChatGPT-User Allow: / User-agent: PerplexityBot Allow: / User-agent: Perplexity-User Allow: / User-agent: Claude-SearchBot Allow: / User-agent: Claude-User Allow: /The explicit Allow lines are belt-and-suspenders — most crawlers default to allowed unless disallowed. Including them documents intent for anyone reading the file later and protects against accidental wildcard disallows above.
Recipe 2 — Block everything
Opt out of AI training and AI citations. Stay in Google Search.# OpenAI User-agent: GPTBot Disallow: / User-agent: OAI-SearchBot Disallow: / User-agent: ChatGPT-User Disallow: / User-agent: OAI-AdsBot Disallow: / # Anthropic User-agent: ClaudeBot Disallow: / User-agent: Claude-User Disallow: / User-agent: Claude-SearchBot Disallow: / # Google AI (separate from Googlebot) User-agent: Google-Extended Disallow: / # Perplexity User-agent: PerplexityBot Disallow: / User-agent: Perplexity-User Disallow: / # Apple Intelligence User-agent: Applebot-Extended Disallow: / # Meta AI User-agent: Meta-ExternalAgent Disallow: /Notice Googlebot is not listed — blocking Google-Extended does not block Googlebot. Your pages still appear in regular Google Search. This is the file every news publisher who opted out of AI training is running.
Recipe 3 — Allow everything
Maximum AI visibility. Train me, cite me, fetch me on demand.# No AI-specific Disallow lines needed. # Default robots.txt behavior allows all crawlers # unless explicitly disallowed. The cleanest version # of "allow all AI bots" is to not mention them at all. User-agent: * Allow: /Counterintuitively, the most permissive AI config is the shortest. Allow lines for individual AI bots are redundant if your wildcard is already permissive. Listing the AI bots here would imply you considered the question and made a deliberate choice — which is sometimes valuable for documentation purposes; usually it's noise.
Three checks after editing robots.txt: (1) curl your robots.txt and confirm the file you edited is the one being served, (2) use a robots.txt validator to confirm syntax is correct, (3) wait 24–72 hours and check your server logs to confirm crawler behavior matches what you configured. Cloudflare, Vercel, and Netlify can each override robots.txt with their own bot management settings — if you are behind any of them, check both layers.
Is the bot really GPTBot?
User-agent strings are trivially spoofable. Anyone scraping the web can set their User-Agent header to GPTBot's exact string and request your pages. Some scrapers do exactly this — they assume site owners give known good-bot UAs the green light and pass without rate-limiting.
OpenAI publishes the list of IP ranges its bots crawl from at openai.com/gptbot.json. The JSON file is updated when ranges change. The verification pattern is: when a request claims to be GPTBot, check the source IP against the published ranges. If it matches, the request is legitimate. If it does not, you have a spoofer.
For sites behind Cloudflare, the Bot Analytics view already surfaces verified-versus-unverified bot traffic, and Cloudflare maintains its own GPTBot verification independent of the user-agent string. For sites running their own infrastructure, the verification logic belongs in your application layer or in your CDN's bot rules — not in robots.txt, which is a directive to the crawler, not a verification of the crawler.
Anthropic and Perplexity publish equivalent IP range references. The pattern is the same: never trust the user-agent header alone for identification.
The other AI crawlers — quick reference
Brief reference for the non-OpenAI bots. The same training-vs-inference distinction applies to most of them. Block in robots.txt the same way you block any user agent.
Anthropic
ClaudeBot · Claude-User · Claude-SearchBot
ClaudeBot is training. Claude-User is the per-user fetch (Claude with web access). Claude-SearchBot is the search index. As of 2026 Anthropic deprecated the older anthropic-ai and Claude-Web user agents; rules referencing those names no longer match Anthropic traffic and should be updated.
Perplexity
PerplexityBot · Perplexity-User
PerplexityBot is the primary indexing crawler. Perplexity-User is the per-user fetch when a Perplexity user asks a question that requires a fresh page load. Blocking PerplexityBot removes you from Perplexity search results.
Google
Google-Extended
Separate from Googlebot. Google-Extended controls whether Google can use your content for AI training (Gemini) and AI Overviews. Blocking it has no effect on traditional Google Search ranking.
Apple
Applebot-Extended
Separate from Applebot. Apple-Extended is Apple Intelligence training data; Applebot is Spotlight, Siri suggestions, and Safari. Blocking Applebot-Extended removes you from AI training without affecting Apple Search.
Meta
Meta-ExternalAgent · meta-externalfetcher
Meta's AI training crawlers. Newer than the OpenAI and Anthropic equivalents. Documentation and behavior are still evolving as of mid-2026.
Where llms.txt fits
llms.txt is a separate proposal — a markdown file at the site root that tells AI agents what content exists on your site and how to find it. It was proposed by Jeremy Howard in September 2024 (llmstxt.org). The format is markdown, not the robots.txt grammar. The intent is roughly: where robots.txt tells crawlers what they may not read, llms.txt tells AI agents what they should prioritise reading.
Adoption is limited as of mid-2026. No major LLM provider has publicly confirmed they consume llms.txt as a primary signal. Ahrefs and Semrush both ran analyses on the question and reported no observed crawler traffic specifically attributable to llms.txt parsing. That does not mean it does nothing — it means there is no measured benefit yet. The standard is young.
Treat llms.txt as complementary to robots.txt, not a replacement. They control different things and have different syntaxes. If you ship one, ship both. If you only ship robots.txt today, that is the correct prioritisation; llms.txt can be added later when adoption signals turn measurable.
Frequently asked questions
What's the difference between GPTBot and ChatGPT-User?
GPTBot crawls the web on OpenAI's own schedule to gather training data for future models. ChatGPT-User fetches a specific URL on demand when an individual ChatGPT user clicks a citation or asks a question that requires loading a page. GPTBot's crawls are scheduled and broad; ChatGPT-User's are triggered per-user and narrow. They have different user-agent strings, different traffic patterns, and different blocking implications.
Will blocking GPTBot stop ChatGPT from citing my site?
No — blocking GPTBot alone does not affect current ChatGPT citations. GPTBot trains future models. The bots that affect current ChatGPT visibility are OAI-SearchBot (which builds the search index ChatGPT uses) and ChatGPT-User (which fetches pages on demand). To stay citable in ChatGPT while opting out of training, block GPTBot and allow OAI-SearchBot and ChatGPT-User.
How do I verify a request really came from GPTBot?
OpenAI publishes IP ranges at openai.com/gptbot.json. When a request claims to be GPTBot, check the source IP against that file. If it matches, the request is legitimate; if it does not, the user-agent is being spoofed. Cloudflare Bot Analytics handles this verification automatically for sites behind Cloudflare. User-agent strings alone are trivially spoofable and should never be the only verification.
Does Cloudflare's one-click 'Block AI bots' toggle block all OpenAI bots?
It blocks GPTBot and OAI-SearchBot together. That combination removes you from both OpenAI training and current ChatGPT search citations — which is more than many sites intend when they flip it. If you want to opt out of training but stay citable in ChatGPT, do not use the one-click toggle; configure robots.txt manually with the granularity the OpenAI bots support.
Is llms.txt a replacement for robots.txt?
No. They control different things. robots.txt tells crawlers what they may not access. llms.txt is a markdown file telling AI agents what content exists on your site and how to find it. They are complementary, not substitutes. As of mid-2026 no major LLM provider publicly confirms consuming llms.txt as a primary signal, so robots.txt remains the load-bearing file for crawler control.
How often does OpenAI change the GPTBot user agent?
Rarely, but occasionally — typically when they rev the crawler version (e.g., GPTBot/1.0 → GPTBot/1.2). The vendor source of truth is developers.openai.com/api/docs/bots. Pin a check against that page in whatever process you use to audit your robots.txt; if you find a change, update your rules. The IP-range file (openai.com/gptbot.json) is updated more frequently than the user-agent string.
- OpenAI — Overview of OpenAI Crawlersdevelopers.openai.com →
- OpenAI — GPTBot IP ranges (JSON)openai.com →
- Anthropic — Web fetch and crawler documentationanthropic.com →
- Google — Google-Extended controls for AI trainingdevelopers.google.com →
- Cloudflare Radar — GPTBot crawler statsradar.cloudflare.com →
- Search Engine Journal — Complete crawler list for AI user agents (Petrosyan, Dec 2025)searchenginejournal.com →
- Playwire — The complete list of AI crawlers and how to block each oneplaywire.com →
- Jeremy Howard — The /llms.txt file specificationllmstxt.org →
- ai.robots.txt — community-maintained user-agent reference (GitHub)github.com →
- Aggarwal et al. — GEO: Generative Engine Optimization (KDD 2024)arxiv.org →