AI Agents

How AI Answer Engines Actually Pick Sources

May 6, 2026·7 min read

Search Traffic Is Not Going to Zero, But It Is Changing Shape

For roughly twenty-five years, the dominant question for any publisher who wanted free traffic was: how do I rank in Google's organic results? In 2026, that question is no longer sufficient. A meaningful and growing share of information lookups now happens inside answer engines: ChatGPT's web tool, Perplexity, Claude's web search, Gemini's AI Overviews, You.com, Brave's Leo, and a dozen smaller surfaces. These tools do not return ten blue links; they return a synthesized answer, with a small number of cited sources tucked into footnotes, sidebars, or hover cards.

The economic impact is asymmetric. If your page used to be the third Google result, you would still get clicks. If your page is one of three sources cited at the bottom of a ChatGPT answer, you might get a tiny attribution link, but the user has already gotten what they came for. Click-through rates from answer engines are widely reported to be much lower than from traditional SERPs, though the data is still messy and varies enormously by query type.

This post is an attempt to describe, as honestly as possible, how these engines appear to pick the sources they cite, what genuinely works to improve your odds, and where the field is still uncertain enough that you should hold your strategy loosely.

How Each Major Engine Appears to Pick Sources

Let me say this carefully: nobody outside the engineering teams at OpenAI, Perplexity, Anthropic, and Google knows exactly how citation selection works inside these systems. Public documentation is sparse. What follows is informed by published material from the companies, observed behavior, and reasonable inferences from how the underlying retrieval architectures work.

ChatGPT (Web Search)

ChatGPT's browsing tool uses a search backend (Bing for the consumer product, with some reranking and filtering) and pulls a small number of pages into context for the model to read. Citations appear when the model decides a piece of information needs attribution. From observed behavior, the engine seems to favor sources that are crawlable, have reasonable authority signals, and have content that directly answers the user's question. Pages with structured data and clear headings appear to be cited more readily, presumably because they are easier for the retrieval layer to extract clean snippets from.

Perplexity

Perplexity is the most transparent about its retrieval approach. Its public material describes a multi-source retrieval pipeline that pulls from a custom index, search APIs, and direct fetches. Perplexity has been notably willing to cite smaller and newer sites, including Reddit threads, technical blog posts, and forum discussions. Recency appears to matter more than for traditional Google ranking. If your page is fresh and well-structured, Perplexity will surface it, even if your domain has limited backlink authority.

Claude (Web Search)

Claude's web search tool is available in both the consumer Claude.ai product and through the API. Citations are returned as structured objects rather than inline footnotes, which suggests Anthropic is treating provenance as a first-class concern. Observed behavior is similar to other retrieval-augmented systems: clear, well-structured pages with directly relevant content tend to get cited.

Gemini and Google AI Overviews

Gemini-powered AI Overviews in Google Search are the highest-stakes surface for most publishers because they sit directly above the ten blue links. Google has confirmed publicly that AI Overviews use the same crawling and indexing infrastructure as traditional search, which means traditional SEO fundamentals still apply: crawlability, page experience, helpful content, E-E-A-T signals. The difference is that AI Overviews appear to favor sources that can be quoted in short, factual sentences without additional editorial gloss.

How AEO Actually Differs From SEO

It would be convenient to say that "answer engine optimization" is just SEO with new acronyms, and a lot of the existing playbook does carry over. But there are meaningful differences in emphasis that any working publisher will notice within a few months of paying attention.

Citation density beats ranking density. Traditional SEO rewards pages that capture broad keyword footprints. AEO rewards pages that contain a small number of specific, citable facts. A 4000-word listicle that "covers everything" is often less likely to be cited than a focused 800-word piece with three concrete claims and clear sourcing.
Primary-source linking matters more. Answer engines preferentially cite content that itself links to and quotes primary sources. If your post about a new IRS rule links directly to the relevant section of the Internal Revenue Code rather than to another blog post that summarizes it, retrieval systems treat your page as closer to authoritative.
Headings and short paragraphs dominate. Answer engines extract spans of text. Spans that are well-bounded by an H2 or H3 above and a paragraph break below are easier to extract cleanly. Long, meandering paragraphs full of compound sentences may rank fine in Google but are harder for retrieval systems to lift cleanly.
Schema.org coverage is no longer optional. JSON-LD markup for Article, FAQPage, BreadcrumbList, Organization, and HowTo gives retrieval systems explicit hooks for what your page is and how to attribute it. The schemas are documented at schema.org and adding them is mostly mechanical work.
Recency is treated as a stronger signal. Traditional SEO treats freshness as one factor among many. Answer engines, especially Perplexity and the newer ChatGPT variants, appear to weight recency heavily for any query that has temporal context. A piece dated two months ago will often beat a piece dated three years ago even if the older piece is technically more authoritative.
Crawl access for AI bots is its own decision. The decision of whether to allow GPTBot, ClaudeBot, PerplexityBot, and Google-Extended in your robots.txt is now a strategic choice. Blocking them protects your content from being trained on or summarized; allowing them is the price of admission for being citable. Most publishers serious about AEO allow them.

What a Publisher Should Do This Week

Here is a practical, week-of-effort plan that is defensible regardless of how the field evolves over the next year. None of this is a silver bullet. It is the boring infrastructure work that has to be in place before any more speculative tactic has a chance of mattering.

Day 1 — Schema.org Coverage

Audit every important page type on your site. For each, identify the appropriate schema and add JSON-LD. The minimum set for most content sites is Organization (homepage), Article or BlogPosting (every post), FAQPage (anywhere you have a Q-and-A section), BreadcrumbList (every non-homepage), and WebSite with a SearchAction (homepage). Validate everything with the Schema.org validator and Google's Rich Results Test.

Day 2 — llms.txt

Publish a llms.txt file at your site root. The convention, documented at llmstxt.org, is a simple Markdown file that gives LLMs a structured map of your site: top-level sections, key pages, and links. It is a low-effort addition and removes any excuse a retrieval system might have for mis-summarizing your site. Pair it with a more detailed llms-full.txt if your site has a substantial number of resources worth indexing.

Day 3 — FAQ Markup on Your High-Intent Pages

Identify the five to ten pages on your site that target high-intent informational queries. Add an FAQ section to each with three to five carefully written question-and-answer pairs, marked up with FAQPage schema. The questions should mirror how a real user would phrase a search. The answers should be short, factual, and self-contained. This is one of the most direct ways to surface in answer engines because the structure exactly matches what they want to extract.

Day 4 — Audit and Strengthen Primary-Source Linking

Walk through your top-traffic posts and audit the links. For every claim that depends on an external authority, link to the primary source rather than to a secondary one. If you cite a study, link to the published paper, not to a news article about the paper. If you reference a regulation, link to the regulator's official text. This both improves the quality of your content and signals to retrieval systems that you are operating closer to ground truth.

Day 5 — Fix Your robots.txt and Cron in IndexNow

Decide your stance on AI crawlers and reflect it explicitly in robots.txt. Allow the bots you want to be cited by, disallow the rest. Then implement IndexNow, the open submission protocol jointly maintained by Microsoft Bing and Yandex (documentation at indexnow.org). When you publish or update content, ping the IndexNow endpoint so participating engines learn about the change immediately rather than on their normal crawl schedule. Many publishers have observed that AI-powered surfaces consume IndexNow signals as well, though the public confirmation is uneven.

How Reliable Is Current AEO Advice?

This is the part of the post where I want to be especially careful. A lot of what is written about AEO right now is recycled SEO advice with the acronym swapped, written by people who have not actually tested whether their recommendations move citation share inside ChatGPT or Perplexity. The honest answer is that the field is genuinely uncertain on several axes.

We do not yet have public, audited data on what fraction of citations a given engine assigns by domain authority versus by content match versus by recency. We do not know how stable these weights are over time, or how they vary by query intent. Engines change their retrieval pipelines without notice, and those changes can move citation patterns substantially in a quarter. Any publisher running AEO experiments today should expect their conclusions to have a half-life of about six to twelve months.

What is unlikely to change in the next twelve to twenty-four months: schema.org coverage, primary-source linking, FAQ markup, and clean structured headings will continue to be net positive. These are aligned with how retrieval-augmented systems work in general, not just with the current implementations. The opposite end of the spectrum: clever prompt-engineering of your own content to "manipulate" how an LLM summarizes it is a fragile, short-half-life game and should not be the foundation of your strategy.

Hold your strategy loosely. Build the infrastructure. Watch which of your pages actually get cited and study them. Do not over-fit to any single engine.

The Five-Bullet Checklist

Add JSON-LD schema (Organization, Article, FAQPage, BreadcrumbList) to every relevant page and validate it.
Publish llms.txt and llms-full.txt at your site root following the llmstxt.org convention.
Add FAQ sections with FAQPage markup to your highest-intent pages.
Audit your links so that every authoritative claim points to a primary source, not a summary.
Decide your AI-bot policy in robots.txt explicitly, then implement IndexNow so updates are surfaced immediately.

None of this is exciting. All of it is durable. In a field where the specific tactics are likely to keep shifting, durable beats exciting every time.

Related Free Tools

Agent ComparisonCompare AI agent platforms side by side Cost CalculatorEstimate AI agent deployment costs Readiness AssessmentAssess your org's AI readiness

Stay Informed

Get ecosystem updates

New tools, posts, and ecosystem news — no spam, unsubscribe anytime.