GEO: SEO for AI Answer Engines

For two decades, the deal was simple: rank on page one, win the click. Search was a list of ten blue links, and SEO was the craft of climbing that list. That deal is being rewritten in real time.

Open Google today and a query like “how do I rotate a JWT signing key” often returns an AI Overview — a synthesized paragraph at the top of the page, stitched together from several sources, with small citation chips beside it. Ask the same thing in ChatGPT (with browsing), Perplexity, or Gemini, and you get a direct prose answer with footnotes. In every case, the user gets their answer without visiting your site. The blue link still exists, but it has been demoted below the machine’s summary.

This is the shift from retrieval (here are pages, you go read them) to synthesis (here is the answer, sources in the footnotes). The discipline of optimizing for that new surface has a few names — GEO (Generative Engine Optimization) and AEO (Answer Engine Optimization) being the most common. They describe the same goal: not just ranking, but being the source the model quotes and links.

The good news for anyone who already does real SEO: GEO is not a new religion. It is an extra layer on top of the foundations you already know. The bad news: a lot of pages that rank fine on classic SERPs are nearly invisible to answer engines, because they were written for human skimmers and crawlers — not for extraction by a language model. This guide is about closing that gap.

How AI answer engines pick sources

Before optimizing, understand the pipeline. An answer engine does not “read the web” at query time. It works in roughly four stages, and each one is a filter you must pass.

Stage	What happens	What it means for you
1. Corpus	The model draws on indexed/crawled content — Google’s index for AI Overviews, Bing’s index + live fetch for ChatGPT, Perplexity’s own crawl.	If you are not indexed (or not crawlable), you cannot be cited. Period.
2. Retrieval	For a given query the engine retrieves a candidate set of passages, usually via the underlying search index plus vector similarity.	Classic relevance signals still gate entry. Ranking is the price of admission.
3. Synthesis	The LLM reads the candidate passages and composes an answer, preferring statements that are clear, self-contained, and corroborated.	Extractable, unambiguous sentences get pulled in. Vague, throat-clearing prose gets skipped.
4. Attribution	The engine attaches citations to the sources whose passages most directly support the generated claims.	The page that stated the fact cleanly gets the link — not necessarily the one that ranked #1.

A few patterns fall out of this that are worth internalizing:

They cite what they can already reach. Answer engines overwhelmingly source from content that is already indexed and crawlable. There is no separate “GEO index” — your classic technical SEO is your GEO foundation.
They prefer clearly stated, extractable facts. A model assembling an answer favors a sentence it can lift with minimal rewriting: “GPTBot respects robots.txt directives.” It will pass over “When it comes to crawler etiquette, there are a number of considerations one might weigh…”
They reward consensus and corroboration. If five reputable sources state the same fact and yours is one of them, you reinforce the model’s confidence and raise your odds of being the cited representative. Contrarian-but-correct content is harder to surface because it lacks corroboration.
They lean on recognizable, well-defined entities. Models reason over entities (people, products, libraries, concepts). Content that names entities precisely and links them to authoritative references is easier to ground and attribute.

🧑‍💻 Developer view: think of the synthesis step like a RAG pipeline you’d build yourself. You chunk documents, embed them, retrieve top-k passages, and stuff them into a prompt. Now imagine your page is one of those chunks. Would the model be able to answer the user’s question from your chunk alone, without the surrounding 2,000 words of context? If not, you won’t get extracted. Write so that any single section survives being ripped out of context.

Writing for extraction

Classic SEO copywriting optimizes for a human who scans, and a crawler that weighs keywords. Extraction-optimized writing adds a third reader: a model that needs to lift a clean, correct, self-contained answer. Here is how to write for all three at once.

Use question-shaped headings, then answer immediately. Real queries are questions. Mirror them in your H2/H3, then put the answer in the first sentence below the heading — not three paragraphs down.

## Does GPTBot respect robots.txt?

Yes. GPTBot, OpenAI's web crawler, obeys robots.txt rules. To block it,
add a `User-agent: GPTBot` group with `Disallow: /` to your robots.txt.
After that one-line answer, expand with nuance, edge cases, and examples.

This “answer-first, elaborate-after” shape (sometimes called the inverted pyramid) gives the model a self-contained passage at the top and gives skimming humans the payoff without scrolling.

Write atomic fact sentences. A fact sentence states one thing, names its subject explicitly, and avoids pronouns that point outside the sentence. Compare:

Weak (un-extractable)	Strong (extractable)
“It was released a couple years back and is pretty widely used now."	"Astro 1.0 was released in August 2022 and is used by over 100,000 websites."
"There are several of them you should probably block."	"The three most common AI crawlers are GPTBot, PerplexityBot, and Google-Extended."
"This is generally considered best practice."	"Server-side rendering is recommended because most AI crawlers do not execute JavaScript.”

The strong versions can be quoted verbatim and still make sense in isolation. That is the whole game.

Lead with structure: lists, tables, and steps. Models latch onto structured data because the boundaries between items are explicit. A comparison table, a numbered procedure, or a definition list is far more “liftable” than the same information buried in prose. If you can express something as a table, do it.

Name entities precisely and consistently. Say “the robots.txt Disallow directive”, not “that setting”. Use canonical product names, version numbers, and dates. Where it helps grounding, reinforce entities with structured data — Article, FAQPage, HowTo, and Organization schema give the engine machine-readable confirmation of what your page is about and who stands behind it. Our schema generator can scaffold these.

Add a short, literal summary near the top. A 2–3 sentence TL;DR that directly answers the page’s core question gives the model a pre-chunked passage to grab — and gives humans a reason to stay.

💡 Tip: write your FAQPage schema and your on-page Q&A from the same source of truth. Pull the actual questions from Search Console’s query report, People Also Ask, and AI Overview follow-ups. You are reverse-engineering the exact phrasing real users type into answer engines.

llms.txt and crawlability

The single biggest GEO mistake is invisible: your content renders fine in a browser but is empty to a crawler that doesn’t run JavaScript.

Most AI crawlers do not execute JS. GPTBot, PerplexityBot, and the classic indexing bots fetch your HTML and read what’s in the response body. If your content is hydrated client-side by a SPA, the bot sees an empty <div id="root"> and moves on. This is why SSR/SSG wins — and why a static framework like the one powering this site (Astro, which ships fully-rendered HTML by default) is structurally well-suited to GEO. Verify what crawlers actually see:

# What a JS-less crawler sees — should contain your real content, not an empty shell
curl -sA "GPTBot" https://example.com/your-page/ | grep -i "your headline text"

# Compare raw HTML size vs. a rendered DOM. A huge gap means content is JS-injected.
curl -s https://example.com/your-page/ | wc -c

If the curl output doesn’t contain your actual copy, no amount of clever writing matters — fix rendering first. See build for the SSR/SSG decision.

llms.txt is an emerging convention, not a ranking signal. Proposed in 2024, llms.txt is a Markdown file at your domain root that offers LLMs a curated, clean map of your most important content — think of it as a sitemap.xml written for language models instead of crawlers. A minimal one:

# Your Site Name

> One-line description of what this site is and who it's for.

## Docs
- [Getting started](https://example.com/docs/start): Install and first build.
- [API reference](https://example.com/docs/api): Full endpoint list.

## Guides
- [GEO for AI search](https://example.com/en/guides/geo-ai-search/): This guide.

Be clear-eyed about it: as of mid-2026, llms.txt is not confirmed to be consumed by Google, OpenAI, or Perplexity for ranking or citation. It costs almost nothing to publish and may help future tooling, but treat it as a low-priority nice-to-have, not a lever. The real work is clean HTML and extractable content.

Decide deliberately whether to allow AI crawlers. You control AI access in robots.txt per user-agent. The major ones to know:

User-agent	Operator	What allowing it does
`GPTBot`	OpenAI	Lets your content be used for ChatGPT answers/training.
`OAI-SearchBot`	OpenAI	Powers ChatGPT search results and citations.
`PerplexityBot`	Perplexity	Lets Perplexity index and cite your pages.
`Google-Extended`	Google	Gemini/Vertex training opt-in. Does not affect AI Overviews or Search.
`ClaudeBot`	Anthropic	Crawling for Claude.

# robots.txt — example: allow AI citation crawlers, opt out of training-only
User-agent: GPTBot
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Disallow: /

⚠️ Note: blocking Google-Extended does not remove you from Google Search or AI Overviews — those run on the standard Googlebot. There is currently no separate opt-out for AI Overviews short of blocking Googlebot entirely, which would delist you. Choose your trade-offs knowingly. Our robots & sitemap tool helps you assemble and validate these rules.

Measuring AI traffic

You cannot manage what you cannot see, and AI surfaces are deliberately murky. Here is what you can actually measure today.

Google Search Console — watch for the AI Overviews footprint. Google does not give you a clean “AI Overviews” filter, but the signal shows up indirectly:

Impressions up, CTR down. When your page is cited inside an AI Overview, you accrue an impression but the user often gets their answer without clicking. A rising-impressions / falling-CTR pattern on informational queries is the classic GEO fingerprint.
Position volatility on question queries. AI Overviews reshuffle what “position 1” even means. Track informational vs. transactional queries separately so the noise doesn’t hide your transactional wins.

Referral traffic — find the answer engines in your analytics. When a user does click through from an AI answer, it shows up as a referral. Watch these hostnames:

chatgpt.com        # clicks from ChatGPT citations
perplexity.ai      # clicks from Perplexity answers
gemini.google.com  # clicks from Gemini
copilot.microsoft.com

In GA4, build an exploration filtered to Session source matching those hosts, or query the raw events:

-- GA4 BigQuery export: sessions referred by AI answer engines
SELECT
  traffic_source.source AS source,
  COUNT(DISTINCT user_pseudo_id) AS users,
  COUNT(*) AS events
FROM `your_project.analytics_XXXXXX.events_*`
WHERE REGEXP_CONTAINS(
        traffic_source.source,
        r'chatgpt\.com|perplexity\.ai|gemini\.google\.com|copilot\.microsoft\.com')
GROUP BY source
ORDER BY users DESC;

Spot-check the engines directly. Low-tech but effective: periodically ask your target questions in ChatGPT, Perplexity, and Google, and record whether you’re cited, which competitors are, and what phrasing the model used. Keep a tracking sheet of query → cited? → competitor cited → notes. This qualitative audit catches things no dashboard will. See analytics for setting up the measurement plumbing.

💡 Tip: instrument referrals before you start optimizing, so you have a baseline. AI referral volume is still small for most sites — single-digit percentages — but it is high-intent and growing. Measuring it now means you’ll recognize the inflection when it comes.

GEO and classic SEO

It is tempting to treat GEO as a successor that makes SEO obsolete. It isn’t. GEO is an additive layer, not a replacement — and the layer underneath it is the same SEO you already practice.

Consider what every stage of the answer-engine pipeline depends on:

Being in the corpus requires indexability — robots, sitemaps, canonical tags, crawl budget. Pure technical SEO.
Surviving retrieval requires topical relevance and authority — the same signals that earn classic rankings.
Getting synthesized and cited requires trust and corroboration — which is E-E-A-T, links, and a credible entity, the things good SEO has always built.

In other words, an answer engine cannot cite a page it cannot crawl, would not retrieve a page that lacks authority, and will not trust a page with no track record. Weak SEO produces weak GEO by construction. The order of operations is unchanged:

Technical health first. Crawlable, fast, server-rendered HTML. Without this, nothing downstream works.
Genuinely useful content second. Original, accurate, well-structured — earning the right to be a source.
Extraction polish third. Question headings, atomic facts, tables, summaries — the GEO-specific layer that turns a rankable page into a citable one.

What GEO changes is the definition of success. The win is no longer only “rank #1 and capture the click.” It is also “be the sentence the model quotes, with your name on the footnote.” Sometimes that means fewer clicks but more influence — your brand becomes the answer even when no one visits. Adjust your goals and your measurement accordingly, but don’t tear up the foundation. The foundation is what makes the new layer possible.

Key takeaways

✅ GEO is a layer on top of SEO, not a replacement — an answer engine can only cite content it can already crawl, retrieve, and trust, so technical health and quality content remain the foundation.
✅ Write for extraction: question-shaped headings with answer-first sentences, atomic self-contained facts, and structured tables/lists the model can lift verbatim.
✅ Render server-side — most AI crawlers don’t execute JavaScript, so verify with curl -A "GPTBot" that your real content is in the raw HTML; SSG/SSR frameworks like Astro have the edge.
✅ Set robots.txt deliberately for GPTBot, OAI-SearchBot, PerplexityBot, and Google-Extended — and remember blocking Google-Extended does not remove you from AI Overviews.
✅ Treat llms.txt as optional — cheap to publish, but not yet a confirmed ranking or citation signal; spend your effort on clean HTML and clear facts.
✅ Measure the new surface: watch GSC for the rising-impressions / falling-CTR fingerprint, track chatgpt.com and perplexity.ai referrals in GA4, and spot-check citations by hand.