robots.txt

A plain-text file at the root of your domain that tells crawlers which paths they may or may not fetch.

Definition

robots.txt is a standard file at `yourdomain.com/robots.txt` that web crawlers check before requesting other pages. It uses `User-agent` and `Disallow`/`Allow` directives to specify which paths each crawler can access. Most AI crawlers respect robots.txt as a binding policy, though they are not technically required to.

Why it matters for AI visibility

A misconfigured or overly broad robots.txt is the single most common reason AI crawlers cannot access important pages. Even a single `Disallow: /` directive under `User-agent: *` will block all crawlers from all pages.

GPTBotOpenAI's web crawler that fetches content to train and update its models.

ClaudeBotAnthropic's crawler used to collect content for training and grounding Claude models.

PerplexityBotPerplexity AI's crawler that indexes content for real-time answer generation.

Google-ExtendedGoogle's opt-out user-agent for AI product training, separate from regular search crawling.

CCBotThe Common Crawl bot that builds the publicly accessible web archive used to train many AI models.

XML SitemapA structured file listing the canonical URLs on your site so crawlers can discover them efficiently.

↗ Checklist: robots.txtA valid robots.txt file gives crawlers a predictable policy for what they may request.

Check your site

The free scan checks crawler access, robots.txt, sitemap, structured data, and discoverability — and turns the results into a prioritized fix list.

Run the free scan →Back to glossary