Layer 0: Technical Access

Robots.txt for AI Crawlers:
The Complete Configuration Guide

The number one AEO failure point is invisible, takes 5 minutes to fix, and most businesses don't know they have it.

The Problem No One Checks

Every website has a robots.txt file. It tells crawlers — search engines, AI systems, scrapers — what they're allowed to access. Most businesses never look at theirs. And most default configurations quietly block the exact AI crawlers that power ChatGPT, Claude, Perplexity, and Google's AI Overviews.

This means your content could be exceptional, your schema perfect, your authority signals strong — and AI systems still can't see any of it. They're being told not to look.

At Collins Tech, we call this Layer 0: Technical Access. It's the first thing we check on every engagement because it's the most common failure mode and the easiest to fix. If Layer 0 fails, Layers 1 through 4 are irrelevant.

The AI Crawlers You Need to Allow

As of 2026, there are nine major AI crawlers that determine whether your business appears in AI-generated answers. Each has its own User-agent identifier. If any of them are blocked in your robots.txt, that AI system cannot index your content.

The Nine Crawlers

GPTBot — OpenAI's primary crawler. Powers ChatGPT and the OpenAI API. If this is blocked, your business doesn't exist to ChatGPT.

ChatGPT-User — ChatGPT's browsing mode crawler. Used when users ask ChatGPT to search the web in real time.

Google-Extended — Google's AI training crawler. Separate from Googlebot (which handles traditional search). Blocking this doesn't affect your Google search ranking, but it prevents your content from appearing in Google's AI Overviews and Gemini.

ClaudeBot — Anthropic's crawler for Claude. Used for training data and real-time retrieval.

anthropic-ai — Anthropic's secondary crawler identifier.

PerplexityBot — Perplexity's crawler. Perplexity is the fastest-growing AI search engine and a primary citation source for business queries.

Amazonbot — Amazon's AI crawler. Powers Alexa responses and Amazon's product recommendation systems.

CCBot — Common Crawl's bot. Its dataset is used by multiple AI systems for training. Blocking this reduces your presence across the entire AI ecosystem.

Bytespider — ByteDance's crawler. Powers TikTok's search features and AI recommendations.

The Correct Configuration

Here's what a properly configured robots.txt looks like for full AI visibility. Every AI crawler gets its own explicit Allow directive — no ambiguity, no inheritance assumptions.

User-agent: *
Allow: /

User-agent: GPTBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: anthropic-ai
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Amazonbot
Allow: /

User-agent: CCBot
Allow: /

User-agent: Bytespider
Allow: /

Sitemap: https://yourdomain.com/sitemap.xml

Replace yourdomain.com with your actual domain. The Sitemap directive tells all crawlers where to find your complete URL list.

Common Misconfigurations

The Default WordPress Block

Many WordPress installations ship with robots.txt rules that block AI crawlers by default. If you've never edited your robots.txt, your site likely blocks GPTBot, ClaudeBot, and others. Check yours right now by going to yourdomain.com/robots.txt in a browser.

The Blanket Disallow

Some sites use Disallow: / for all user agents as a security measure. This blocks everything — including AI systems. If you need to block specific bots (like aggressive scrapers), do it with targeted User-agent directives, not a blanket rule.

The Missing Sitemap

A robots.txt without a Sitemap directive makes AI crawlers discover your pages through links alone. Adding the Sitemap directive tells them exactly where every page is — faster indexing, more complete coverage.

The "Googlebot is Enough" Assumption

Allowing Googlebot does not allow Google-Extended. They're separate crawlers with separate permissions. Your site can rank #1 on Google search and still be invisible to Google's AI Overviews if Google-Extended is blocked.

How to Check Yours

Open a browser and navigate to your domain with /robots.txt appended. For example: yourdomain.com/robots.txt. You'll see a plain text file. Look for any of the nine crawler names listed above. If they're not mentioned, they inherit from the wildcard (*) rule. If the wildcard says Disallow: /, they're blocked.

If you don't see a robots.txt file at all, most servers return an implicit "allow everything" — which means AI crawlers can access your site, but you're missing the opportunity to include a Sitemap directive.

For a more comprehensive check, use our free AEO Visibility Analyzer to scan your site's complete AEO infrastructure — including robots.txt, schema markup, and content structure.

Beyond Robots.txt: The llms.txt File

Robots.txt controls access. But there's a newer standard that goes further: llms.txt. This is a structured plain-text file at the root of your site that gives AI systems a complete briefing on your business — who you are, what you do, what your expertise is, and what content is available.

Think of robots.txt as the door and llms.txt as the welcome packet. The door lets them in; the packet tells them what they're looking at.

Collins Tech maintains both. Our llms.txt includes organizational overview, methodology description, proprietary concepts, publication list, and service area — all formatted for machine parsing. It's 3,700+ bytes of structured intelligence that AI systems can consume in milliseconds.

The Bottom Line

Robots.txt is the single most important file on your website for AI visibility. It takes 5 minutes to configure correctly. It costs nothing. And if it's wrong, nothing else you do for AEO matters.

Check yours today. If you need help interpreting what you see, reach out.

Check Your AEO Visibility →