AI Agents

The Quiet Death of robots.txt: How AI Crawlers Actually Decide What to Read

May 7, 2026·9 min read

The Convention That Built the Open Web

In 1994, a Dutch software engineer named Martijn Koster proposed a small text file at the root of every website. The file would tell automated visitors which paths they were welcome to crawl and which they should leave alone. There was no enforcement mechanism. There was a one-page specification, an honor system, and an industry that more or less agreed to play along. That convention became robots.txt, and for the better part of three decades it held the open web together.

The reason it worked was economic, not technical. The dominant crawlers were search engines, and a publisher who caught one ignoring robots.txt could complain loudly. Non-compliance carried reputational and ranking cost. The honor system worked because everyone with a meaningful crawler had skin in the game.

That equilibrium is under strain. The crawlers publishers are arguing about in 2026 are not Googlebot or Bingbot. They are AI training crawlers, retrieval-augmented generation fetchers, and user-initiated agents acting on behalf of someone typing into a chat box. Each has a different incentive structure and a different track record on respecting the convention.

An Honest Taxonomy of AI Crawlers in 2026

The first source of confusion is that "AI crawlers" is not one category. There are at least three distinct activities under that loose umbrella, governed by different policies even within a single company.

Training crawlers fetch pages so the content can be incorporated into the next foundation model. The fetches are bulk, periodic, and not tied to any specific user query. Retrieval crawlers fetch pages so a model can read them in real time and synthesize an answer. These run when a user asks a question and the answer engine decides your page is relevant. User-initiated agents fetch pages because a specific person, sitting at a chat interface, asked the assistant to read or summarize a specific URL. From the publisher's perspective these look similar in the logs; the consent calculus is quite different.

Here is a working list of AI crawlers a publisher will see in 2026, with their stated purpose.

  • OpenAI: GPTBot (training), ChatGPT-User (user-initiated agent fetches), OAI-SearchBot (search index for the ChatGPT search surface). All three are documented and OpenAI publishes the IP ranges they originate from.
  • Anthropic: ClaudeBot (training), Claude-User (user-initiated fetches), Claude-SearchBot (search index). Anthropic publishes a verification page for confirming each user-agent.
  • Google: Google-Extended is not a separate crawler but a token a publisher places in robots.txt to opt out of having Googlebot's already-fetched content used to train Gemini. GoogleOther is the crawler used for various non-search Google research and product purposes.
  • Apple: Applebot-Extended is the analogue of Google-Extended: it lets a publisher opt out of having Applebot's already-fetched content used to train Apple Intelligence, without affecting Spotlight or Siri search.
  • Perplexity: PerplexityBot (search and summarization) and Perplexity-User (user-initiated agent fetches).
  • ByteDance: Bytespider, used for training the company's models.
  • Common Crawl: CCBot, a non-commercial archive whose scrapes underpin a substantial fraction of every public LLM ever trained.
  • Diffbot: Diffbot, which sells a structured-knowledge graph derived from web content.
  • Amazon: Amazonbot, used for Alexa-related and other Amazon AI products.
  • Meta: Meta-ExternalAgent, used for AI training and product purposes.

This list is not exhaustive and it is moving. The right defensive posture is not to memorize a static list but to monitor server logs for new user-agent strings and update robots.txt on an ongoing basis.

The Perplexity Episode and What It Showed

In June 2024, the developer Robb Knight published an investigation showing that Perplexity appeared to be fetching pages even when the publisher had explicitly disallowed PerplexityBot in robots.txt. Wired's Andy Greenberg followed with similar findings: requests from cloud IP space, with user-agent strings that looked like a vanilla Chrome browser rather than an identifiable bot, retrieving content for Perplexity to summarize. Wired reported the company appeared to be using a fetcher that bypassed both robots.txt directives and its own published bot identification.

Perplexity's response, in summary, was that some of the disputed traffic came from third-party services it used and that the situation was more nuanced than the early coverage suggested. The company has since updated its public documentation, separated PerplexityBot from Perplexity-User, and clarified its position on user-initiated retrieval. It would be unfair to say the 2024 episode is a current and accurate description of how Perplexity operates today; the company's stated policies have evolved.

What the episode showed, regardless of blame, is the structural problem. robots.txt is a request, not a fence. A crawler that decides to fetch as a generic browser, from a residential proxy, with a rotating user-agent, is technically capable of doing so. Whether it should is a question of norms, contracts, and reputation, not HTTP semantics. Once a category of crawler has a strong product reason to read pages it has been told not to read, the convention starts to wobble.

The AI industry has a wider distribution of behavior than the search-engine industry did at its peak. Some operators are scrupulous, some sloppy, some aggressive. The strategic question shifts from "what does my robots.txt say?" to "what do I do for the operators who do not care what it says?"

The Mechanisms Publishers Actually Have

The tools available to a publisher in 2026 are a mix of the original convention, newer file-based conventions, network-level controls, and commercial agreements. None of them is a complete answer on its own.

robots.txt and X-Robots-Tag

The original mechanisms are still the foundation. A robots.txt file with explicit User-agent blocks for each AI crawler you have a position on remains the cleanest way to publish your stance. The X-Robots-Tag HTTP header and the equivalent <meta name="robots"> tag let you express the same intent at the page level. Compliance is voluntary, but reputable operators do honor these directives, and having an explicit position published makes any later complaint about misuse far more legitimate.

llms.txt

It is worth being precise here, because the two files are often conflated. llms.txt is not a permission file. It is a content map: a Markdown document at the site root that tells an LLM what your site is, where its important sections live, and how a model should think about it when answering a user question. The convention is documented at llmstxt.org. robots.txt says "do not read these paths." llms.txt says "if you do read this site, here is how to read it well." Both are useful, and they sit at different layers.

Network-level controls

The most decisive controls are at the network layer. Cloudflare's bot management and its specific AI-crawler controls let an origin block known AI bots even when the bot ignores robots.txt, because the block happens before the request reaches the application. Other CDNs and WAFs offer similar features. The trade-off is that aggressive bot blocking can also catch legitimate user traffic and benign tools, so the policy needs tuning.

Emerging standards and licensing

Several efforts are underway to put AI-specific content controls on a more formal footing, with active discussion in the IETF and the W3C around AI content disclosure, opt-out signaling, and provenance metadata. None of these are stable standards yet. Track them, but do not bet a strategy on a draft RFC. The more consequential development in 2024 and 2025 was the emergence of paid licensing between AI companies and large publishers — Reddit signed agreements with OpenAI and Google, Stack Overflow signed deals with multiple model labs, and a long list of news organizations has now licensed content to OpenAI, Microsoft, Google, and others. These deals are evidence that, for the largest publishers, the answer to "should AI crawlers read my content?" is no longer a binary in robots.txt; it is a negotiation.

What a Publisher Should Actually Do

The strategic question in 2026 is not "block or allow." It is "what posture do I take, and how do I express it consistently across the mechanisms I have?" There are roughly four defensible postures, and the right one depends on your business model.

Allow everything. Appropriate if your traffic is overwhelmingly informational, your monetization does not depend on the page-view, and you want maximum citation share inside answer engines. The natural posture for educational sites, documentation, and most blogs whose author wants reach. Express it by allowing all known AI crawlers in robots.txt and publishing a clean llms.txt.

Allow citation, disallow training. Appropriate if you want to be cited by ChatGPT, Perplexity, and Claude when a user asks a question whose answer is on your site, but you do not want your content folded into the next foundation model. Express it by allowing user-initiated and search bots (ChatGPT-User, OAI-SearchBot, Claude-User, Claude-SearchBot, Perplexity-User, PerplexityBot) and disallowing the training-only bots (GPTBot, ClaudeBot, Bytespider, CCBot, Google-Extended, Applebot-Extended).

Disallow all AI crawlers. Appropriate if your content is the product, your monetization is paywalled, or you have made a strategic decision that AI summarization is bad for your business. Express it in robots.txt, then enforce it at the network layer because the directive alone will not stop the operators who do not care.

Paywall everything. The strongest posture. Content behind authentication is not visible to bots that are not authenticated, and the legal posture against unauthorized access is much stronger than the posture against ignoring an advisory file. The trade-off is the obvious one: a paywalled site is invisible to search and answer engines alike.

A Decision Tree You Can Apply Today

If you have not thought about your AI-crawler policy explicitly, here is a short sequence of questions that will get you to a defensible posture in under an hour.

  1. Is page-view advertising the primary monetization for the content? If yes, you have a real business reason to think hard about answer-engine summarization, because users who get the answer from a chat interface do not view your ad. If no, the calculus shifts toward allowing more.
  2. Do you want your content cited by name when an answer engine answers a user's question? If yes, you need to allow at least the user-initiated and search bots from the major operators. If no, disallow them and accept the loss of attribution.
  3. Do you want your content used to train future models? A separate question from citation. The training-specific user-agents (GPTBot, ClaudeBot, Bytespider, Google-Extended, Applebot-Extended) let you take a different position on training than on retrieval.
  4. What is your enforcement layer? If your answer to the previous questions is "disallow," your robots.txt alone is not enough. Decide whether you will add network-level blocking, accept that some operators will ignore the directive, or paywall the content.
  5. Are you monitoring server logs? A static robots.txt written six months ago will not address the new bots that have shown up since. A monthly review of unusual user-agent strings in your access logs is the difference between a real policy and a paper one.

The convention that built the open web is not dead, but it no longer does the work it used to do alone. Treat robots.txt as your published position, llms.txt as your content map, your CDN as your enforcement, and the licensing landscape as where the long-term defaults are being set. Together they are the closest thing to a coherent policy in 2026. Whatever you choose, write it down — our own content policy is a worked example of the kind of disclosure AI-era publishers should consider standard.

Stay Informed

Get ecosystem updates

New tools, posts, and ecosystem news — no spam, unsubscribe anytime.