Technical14 min readArticle · 09

GPTBot: OpenAI's four crawlers in 2026

GPTBot is one of four bots. Cloudflare's one-click 'block AI bots' toggle blocks the wrong ones for most sites. Here's the four-bot map, the training-vs-inference distinction nobody else makes correctly, and three robots.txt recipes you can copy.

Ibrahim Furkan Ozcelik · works on GEO and AI search

PublishedJune 20, 2026UpdatedJune 20, 2026Sourceibrahimfurkanozcelik.com

GPTBot is one of four crawlers OpenAI runs. GPTBot trains the next model. OAI-SearchBot indexes the web for ChatGPT search. ChatGPT-User fetches pages on demand when a user clicks a citation. OAI-AdsBot crawls advertiser landing pages. Block GPTBot and your content stays out of the next training run; block OAI-SearchBot or ChatGPT-User and you become invisible inside ChatGPT entirely. They are not the same decision.

I built TurboAudit, a tool that audits AI search visibility (crawler accessibility is one of the things it checks). I have no relationship with OpenAI, no bot-blocking product to sell, and no preference for whether you allow or block any of these crawlers. Every user-agent string, IP range, and code sample below was verified against the vendor's own documentation in June 2026. When OpenAI rotates user agents — and they do — this page becomes wrong. Re-verify before shipping configuration based on it.

Map · 01

The four OpenAI bots

Most pages treat GPTBot as one thing. As of June 2026, OpenAI publicly documents four crawlers. They have different purposes and different blocking implications. The table below maps them.

Bot

User agent

Purpose

Blocking means

GPTBot

Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; GPTBot/1.2; +https://openai.com/gptbot

Training the next generation of OpenAI's models.

Your content will not be used to train future GPT models. It does not affect whether ChatGPT can cite your current pages in answers.

OAI-SearchBot

Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; OAI-SearchBot/1.0; +https://openai.com/searchbot

Building and maintaining the index ChatGPT search uses when answering queries that need fresh web data.

Your pages will not appear in ChatGPT search results. This is the bot whose blocking has the biggest impact on your AI search visibility.

ChatGPT-User

Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ChatGPT-User/1.0; +https://openai.com/bot)

Fetching specific URLs on demand when an individual ChatGPT user clicks a citation or asks a question that requires a live page load. Triggered per user, per session.

ChatGPT users who try to load your page through the chat interface will get a fetch error. The page can still be cited from the existing index if OAI-SearchBot has already seen it.

OAI-AdsBot

Crawling advertiser landing pages so OpenAI can evaluate them for ad placement in ChatGPT's ad surfaces.

Your pages will not be eligible for OpenAI's ad inventory. For most sites this is irrelevant; for advertisers running OpenAI ad campaigns it matters.

These user agents were correct on June 20, 2026. OpenAI's canonical reference is developers.openai.com/api/docs/bots — verify there before shipping any configuration. OpenAI rotates user-agent versions occasionally; this page will be updated when they do, but the canonical source remains theirs.

Distinction · 02

Training versus inference: why the distinction matters

Of all the things this page tries to argue, this one is load-bearing. Read it twice if you read nothing else.

AI crawlers fall into two categories with very different consequences when blocked. Training bots — GPTBot, ClaudeBot, Google-Extended, Applebot-Extended, Meta-ExternalAgent — crawl the web to update the model's training data. The model is then frozen at a particular cutoff date and ships. Blocking these bots means: your content will not be part of the next model release. The effect is delayed and indirect — you affect what future models know about you, not what the current ChatGPT or Claude or Gemini knows.

Inference and citation bots — OAI-SearchBot, ChatGPT-User, PerplexityBot, Perplexity-User, Claude-SearchBot, Claude-User — crawl the web to serve a specific user query. They build the live index that AI search results pull from, and they fetch pages on demand when a user clicks a citation. Blocking these bots means: you will not appear in answers users are receiving right now. The effect is immediate, direct, and visible in your AI search visibility tooling within days.

These are different decisions because the consequences are different. A privacy-conscious site might reasonably want to block training (don't put my content in the model) while keeping inference open (people who ask should still find me). A site that has decided AI search is not their distribution channel might block both. A site that wants maximum AI distribution should allow both.

Most published guidance conflates the two. The standard advice — "add `User-agent: GPTBot Disallow: /` to your robots.txt" — blocks only one of OpenAI's four bots. Cloudflare's popular one-click "Block AI Bots" toggle blocks GPTBot and OAI-SearchBot together. Sites that flip it lose their ChatGPT search citations and frequently do not know why. The toggle does what its label says; the label hides the distinction.

Decision · 03

Should you block GPTBot? Four scenarios

Stop asking "should I block GPTBot" as if it's one question. It's four questions: which combination of training and inference bots fits your situation. Below are the four real configurations sites are running in 2026 and when each one makes sense.

01
Block training, allow inference
BlocksGPTBot, ClaudeBot, Google-Extended
AllowsOAI-SearchBot, ChatGPT-User, Perplexity*, Claude-SearchBot, Claude-User
The have-your-cake-and-eat-it config. Your content does not train the next model — useful if you care about IP or compete with whoever built the model — but you remain citable in ChatGPT search, Perplexity, and Claude search results. Picked by premium publishers and many SaaS marketing sites that want AI visibility without contributing to training. Probably the right default for most brands as of mid-2026.
02
Block everything
BlocksAll AI training and inference bots
AllowsGooglebot only (and other non-AI search engines)
The IP-protective config. You accept invisibility in AI search in exchange for total exclusion from training and live AI citations. Picked by some news publishers (NYT, FT, WSJ at various points), high-end legal and medical sites, and brands whose lawyers have advised maximum exclusion. Honest about the trade-off: you opt out of an entire distribution channel that is growing.
03
Allow everything
BlocksNothing
AllowsAll AI crawlers
The maximum-visibility config. Your content trains future models and appears in current AI citations. Picked by SaaS marketing sites that want to be the default LLM reference for their category, documentation sites that want to become the canonical answer for their tool, and most content marketing operations. The right answer for sites whose business model is being found and cited, including most B2B SaaS.
04
Allow training, block inference
BlocksOAI-SearchBot, ChatGPT-User, PerplexityBot, Claude-SearchBot
AllowsGPTBot, ClaudeBot
Almost always a mistake. You contribute your content to training future models but do not appear in live answers. Opt-in pollution. The only sites for which this makes sense are ones doing some specific research on LLM training corpora, and they tend to know they are an edge case.

Cookbook · 04

The robots.txt cookbook

Three recipes, each named after the scenario above, copyable as-is. Verified against vendor docs on June 20, 2026. Each one is annotated with what it does and what it deliberately leaves out.

Recipe 1 — Block training, allow inference

Most common modern config. Stay out of training, remain citable in AI search.

# Block OpenAI training (GPTBot)
User-agent: GPTBot
Disallow: /

# Block Anthropic training (ClaudeBot)
User-agent: ClaudeBot
Disallow: /

# Block Google AI training (Google-Extended)
User-agent: Google-Extended
Disallow: /

# Block Apple Intelligence training (Applebot-Extended)
User-agent: Applebot-Extended
Disallow: /

# Block Meta AI training
User-agent: Meta-ExternalAgent
Disallow: /

# Explicitly allow inference / citation bots
User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Perplexity-User
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: Claude-User
Allow: /

The explicit Allow lines are belt-and-suspenders — most crawlers default to allowed unless disallowed. Including them documents intent for anyone reading the file later and protects against accidental wildcard disallows above.

Recipe 2 — Block everything

Opt out of AI training and AI citations. Stay in Google Search.

# OpenAI
User-agent: GPTBot
Disallow: /

User-agent: OAI-SearchBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: OAI-AdsBot
Disallow: /

# Anthropic
User-agent: ClaudeBot
Disallow: /

User-agent: Claude-User
Disallow: /

User-agent: Claude-SearchBot
Disallow: /

# Google AI (separate from Googlebot)
User-agent: Google-Extended
Disallow: /

# Perplexity
User-agent: PerplexityBot
Disallow: /

User-agent: Perplexity-User
Disallow: /

# Apple Intelligence
User-agent: Applebot-Extended
Disallow: /

# Meta AI
User-agent: Meta-ExternalAgent
Disallow: /

Notice Googlebot is not listed — blocking Google-Extended does not block Googlebot. Your pages still appear in regular Google Search. This is the file every news publisher who opted out of AI training is running.

Recipe 3 — Allow everything
Maximum AI visibility. Train me, cite me, fetch me on demand.
```
# No AI-specific Disallow lines needed.
# Default robots.txt behavior allows all crawlers
# unless explicitly disallowed. The cleanest version
# of "allow all AI bots" is to not mention them at all.

User-agent: *
Allow: /
```
Counterintuitively, the most permissive AI config is the shortest. Allow lines for individual AI bots are redundant if your wildcard is already permissive. Listing the AI bots here would imply you considered the question and made a deliberate choice — which is sometimes valuable for documentation purposes; usually it's noise.

Three checks after editing robots.txt: (1) curl your robots.txt and confirm the file you edited is the one being served, (2) use a robots.txt validator to confirm syntax is correct, (3) wait 24–72 hours and check your server logs to confirm crawler behavior matches what you configured. Cloudflare, Vercel, and Netlify can each override robots.txt with their own bot management settings — if you are behind any of them, check both layers.

Verification · 05

Is the bot really GPTBot?

User-agent strings are trivially spoofable. Anyone scraping the web can set their User-Agent header to GPTBot's exact string and request your pages. Some scrapers do exactly this — they assume site owners give known good-bot UAs the green light and pass without rate-limiting.

OpenAI publishes the list of IP ranges its bots crawl from at openai.com/gptbot.json. The JSON file is updated when ranges change. The verification pattern is: when a request claims to be GPTBot, check the source IP against the published ranges. If it matches, the request is legitimate. If it does not, you have a spoofer.

For sites behind Cloudflare, the Bot Analytics view already surfaces verified-versus-unverified bot traffic, and Cloudflare maintains its own GPTBot verification independent of the user-agent string. For sites running their own infrastructure, the verification logic belongs in your application layer or in your CDN's bot rules — not in robots.txt, which is a directive to the crawler, not a verification of the crawler.

Anthropic and Perplexity publish equivalent IP range references. The pattern is the same: never trust the user-agent header alone for identification.

Other crawlers · 06

The other AI crawlers — quick reference

Brief reference for the non-OpenAI bots. The same training-vs-inference distinction applies to most of them. Block in robots.txt the same way you block any user agent.

Anthropic
ClaudeBot · Claude-User · Claude-SearchBot
ClaudeBot is training. Claude-User is the per-user fetch (Claude with web access). Claude-SearchBot is the search index. As of 2026 Anthropic deprecated the older anthropic-ai and Claude-Web user agents; rules referencing those names no longer match Anthropic traffic and should be updated.
Perplexity
PerplexityBot · Perplexity-User
PerplexityBot is the primary indexing crawler. Perplexity-User is the per-user fetch when a Perplexity user asks a question that requires a fresh page load. Blocking PerplexityBot removes you from Perplexity search results.
Google
Google-Extended
Separate from Googlebot. Google-Extended controls whether Google can use your content for AI training (Gemini) and AI Overviews. Blocking it has no effect on traditional Google Search ranking.
Apple
Applebot-Extended
Separate from Applebot. Apple-Extended is Apple Intelligence training data; Applebot is Spotlight, Siri suggestions, and Safari. Blocking Applebot-Extended removes you from AI training without affecting Apple Search.
Meta
Meta-ExternalAgent · meta-externalfetcher
Meta's AI training crawlers. Newer than the OpenAI and Anthropic equivalents. Documentation and behavior are still evolving as of mid-2026.

Adjacent · 07

Where llms.txt fits

llms.txt is a separate proposal — a markdown file at the site root that tells AI agents what content exists on your site and how to find it. It was proposed by Jeremy Howard in September 2024 (llmstxt.org). The format is markdown, not the robots.txt grammar. The intent is roughly: where robots.txt tells crawlers what they may not read, llms.txt tells AI agents what they should prioritise reading.

Adoption is limited as of mid-2026. No major LLM provider has publicly confirmed they consume llms.txt as a primary signal. Ahrefs and Semrush both ran analyses on the question and reported no observed crawler traffic specifically attributable to llms.txt parsing. That does not mean it does nothing — it means there is no measured benefit yet. The standard is young.

Treat llms.txt as complementary to robots.txt, not a replacement. They control different things and have different syntaxes. If you ship one, ship both. If you only ship robots.txt today, that is the correct prioritisation; llms.txt can be added later when adoption signals turn measurable.

FAQ · 08

Frequently asked questions

Q · 01

What's the difference between GPTBot and ChatGPT-User?

GPTBot crawls the web on OpenAI's own schedule to gather training data for future models. ChatGPT-User fetches a specific URL on demand when an individual ChatGPT user clicks a citation or asks a question that requires loading a page. GPTBot's crawls are scheduled and broad; ChatGPT-User's are triggered per-user and narrow. They have different user-agent strings, different traffic patterns, and different blocking implications.

Q · 02

Will blocking GPTBot stop ChatGPT from citing my site?

No — blocking GPTBot alone does not affect current ChatGPT citations. GPTBot trains future models. The bots that affect current ChatGPT visibility are OAI-SearchBot (which builds the search index ChatGPT uses) and ChatGPT-User (which fetches pages on demand). To stay citable in ChatGPT while opting out of training, block GPTBot and allow OAI-SearchBot and ChatGPT-User.

Q · 03

How do I verify a request really came from GPTBot?

OpenAI publishes IP ranges at openai.com/gptbot.json. When a request claims to be GPTBot, check the source IP against that file. If it matches, the request is legitimate; if it does not, the user-agent is being spoofed. Cloudflare Bot Analytics handles this verification automatically for sites behind Cloudflare. User-agent strings alone are trivially spoofable and should never be the only verification.

Q · 04

Does Cloudflare's one-click 'Block AI bots' toggle block all OpenAI bots?

It blocks GPTBot and OAI-SearchBot together. That combination removes you from both OpenAI training and current ChatGPT search citations — which is more than many sites intend when they flip it. If you want to opt out of training but stay citable in ChatGPT, do not use the one-click toggle; configure robots.txt manually with the granularity the OpenAI bots support.

Q · 05

Is llms.txt a replacement for robots.txt?

No. They control different things. robots.txt tells crawlers what they may not access. llms.txt is a markdown file telling AI agents what content exists on your site and how to find it. They are complementary, not substitutes. As of mid-2026 no major LLM provider publicly confirms consuming llms.txt as a primary signal, so robots.txt remains the load-bearing file for crawler control.

Q · 06

How often does OpenAI change the GPTBot user agent?

Rarely, but occasionally — typically when they rev the crawler version (e.g., GPTBot/1.0 → GPTBot/1.2). The vendor source of truth is developers.openai.com/api/docs/bots. Pin a check against that page in whatever process you use to audit your robots.txt; if you find a change, update your rules. The IP-range file (openai.com/gptbot.json) is updated more frequently than the user-agent string.

Sources

OpenAI — Overview of OpenAI Crawlersdevelopers.openai.com →
OpenAI — GPTBot IP ranges (JSON)openai.com →
Anthropic — Web fetch and crawler documentationanthropic.com →
Google — Google-Extended controls for AI trainingdevelopers.google.com →
Cloudflare Radar — GPTBot crawler statsradar.cloudflare.com →
Search Engine Journal — Complete crawler list for AI user agents (Petrosyan, Dec 2025)searchenginejournal.com →
Playwire — The complete list of AI crawlers and how to block each oneplaywire.com →
Jeremy Howard — The /llms.txt file specificationllmstxt.org →
ai.robots.txt — community-maintained user-agent reference (GitHub)github.com →
Aggarwal et al. — GEO: Generative Engine Optimization (KDD 2024)arxiv.org →

The four OpenAI bots

Training versus inference: why the distinction matters

Should you block GPTBot? Four scenarios

Block training, allow inference

Block everything

Allow everything

Allow training, block inference

The robots.txt cookbook

Recipe 1 — Block training, allow inference

Recipe 2 — Block everything

Recipe 3 — Allow everything

Is the bot really GPTBot?

The other AI crawlers — quick reference

Anthropic

Perplexity

Google

Apple

Meta

Where llms.txt fits

Frequently asked questions

New GEO research, as it ships.