VEEZOW

04 / PLAYBOOKS · 07

robots.txt and LLM crawler access

The 16 LLM bots you need to audit access for — including the ones that most teams miss. The right access pattern, the common mistakes, and when allowing access is not enough.

robots.txt is the primary mechanism for governing which automated agents can access your website. For LLM citation optimization, it determines which AI training crawlers can read your content. A misconfigured robots.txt file can block all AI training access — even if every other citation signal is optimized.

The problem: most robots.txt files were configured years ago, before LLM crawlers existed. Teams added broad Disallow rules for /api, /admin, and /blog paths that now inadvertently block AI crawlers from accessing high-value citation content.

The 16 LLM crawlers to audit

These are the bots you need to explicitly allow or carefully evaluate in your robots.txt:

GPTBot (OpenAI) — used for ChatGPT training data. Blocking this removes you from GPT knowledge updates.

OAI-SearchBot (OpenAI) — used for ChatGPT real-time search, separate from GPTBot.

Claude-Web / ClaudeBot (Anthropic) — used for Claude training and real-time retrieval.

anthropic-ai (Anthropic) — secondary Anthropic crawler for retrieval and index verification.

PerplexityBot (Perplexity AI) — used for Perplexity real-time answer generation.

Google-Extended (Google DeepMind) — used for Gemini training, separate from Googlebot.

Gemini-Crawler (Google) — used for Gemini real-time retrieval.

CCBot (Common Crawl) — pretraining substrate for most open-source LLMs.

Diffbot — used by several enterprise LLM systems.

YouBot (You.com) — used for You.com AI search.

Amazonbot — used for Amazon's Alexa and AI services.

Bytespider (ByteDance) — used for TikTok and related AI products.

meta-externalagent (Meta) — used for Meta AI and Llama training data collection.

Applebot-Extended (Apple) — used for Apple Intelligence and Siri knowledge enrichment.

cohere-ai (Cohere) — used for Cohere's enterprise LLM systems.

Baiduspider (Baidu) — used for ERNIE and Baidu AI products (relevant for global brands).

The recommended access pattern

For most brands, the right pattern is explicit allow for all LLM training and retrieval bots on your public marketing and content pages, with explicit disallow for authenticated paths (/account, /dashboard, /api/private) and user-generated content that may contain PII.

A minimal robots.txt that allows all major LLM crawlers:

User-agent: GPTBot Allow: /

User-agent: OAI-SearchBot Allow: /

User-agent: ClaudeBot Allow: /

User-agent: PerplexityBot Allow: /

User-agent: Google-Extended Allow: /

User-agent: CCBot Allow: /

Sitemap: https://yourdomain.com/sitemap.xml

Common mistakes

The most common mistake is a blanket Disallow: / under a wildcard User-agent: * with no subsequent Allow rules for specific bots. This blocks all crawlers including LLM training bots.

The second most common mistake is allowing GPTBot but blocking CCBot. Since Common Crawl is the training substrate for many open-source and non-OpenAI models, blocking CCBot creates a structural gap in training data coverage.

When allowing access is not enough

Allowing bot access creates the opportunity for crawling — but crawlers prioritize pages with good signals. An empty sitemap, slow page load times, or pages with thin content will be deprioritized even if access is technically allowed. Pair your robots.txt access grants with a complete, accurate sitemap and ensure your key content pages load quickly.

Legal considerations

Some legal teams restrict LLM crawler access over concerns about copyright or training data rights. The tradeoff is measurable: brands that block LLM crawlers are systematically less cited. If your legal team has restricted access, the correct conversation is about partial access — allow crawlers on marketing pages and public content, while blocking authenticated and proprietary content. This preserves citation eligibility without exposing proprietary data.

What this means for citation strategy

Run a full robots.txt audit quarterly. Verify that all 16 LLM crawlers above are explicitly allowed on your public content paths. Pair correct bot access with Common Crawl coverage — allowing CCBot is necessary but not sufficient if your sitemap is missing or your pages load slowly. Use Veezow's free scan to check bot access status across the major LLM crawlers — it identifies blocks and missing permissions in the access score.

Measure your current position

Veezow scans your domain for the signals covered in this playbook — robots.txt access, structured data, Common Crawl presence, bot permissions, and off-site mentions — and scores them in one report.

Run a free scan →

Weekly Visibility Index

New data every Monday — citation shifts, engine behaviour changes, and what moved the index this week.

More playbooks

01

Wikipedia presence strategy

02

Wikidata entity graph

03

Earned Reddit and HN presence

All playbooks →

← PREVIOUS

Citation laundering defense

NEXT →

FAQPage schema and answer-engine content