CCBot

The Common Crawl bot that builds the publicly accessible web archive used to train many AI models.

Definition

CCBot is operated by Common Crawl, a non-profit that crawls the web and releases a free, open archive of web content. This archive is one of the most widely used training data sources for large language models, including early versions of GPT, LLaMA, Mistral, and many others. Blocking CCBot excludes your content from future Common Crawl snapshots.

Why it matters for AI visibility

Because so many foundation models train on Common Crawl data, blocking CCBot can reduce your brand's presence across a wide range of AI systems — including those without their own named crawler. Allowing it is a broad-coverage decision.

GPTBotOpenAI's web crawler that fetches content to train and update its models.

ClaudeBotAnthropic's crawler used to collect content for training and grounding Claude models.

Common CrawlAn open archive of billions of web pages used as training data for many of the world's largest AI models.

robots.txtA plain-text file at the root of your domain that tells crawlers which paths they may or may not fetch.

↗ Checklist: AI crawler accessGPTBot, ClaudeBot, PerplexityBot, Google-Extended, and other AI crawlers need clear permission to fetch important pages.

Check your site

The free scan checks crawler access, robots.txt, sitemap, structured data, and discoverability — and turns the results into a prioritized fix list.

Run the free scan →Back to glossary