Glossary / AI Crawlers
AI Crawlers
CCBot
The Common Crawl bot that builds the publicly accessible web archive used to train many AI models.
Definition
CCBot is operated by Common Crawl, a non-profit that crawls the web and releases a free, open archive of web content. This archive is one of the most widely used training data sources for large language models, including early versions of GPT, LLaMA, Mistral, and many others. Blocking CCBot excludes your content from future Common Crawl snapshots.
Why it matters for AI visibility
Because so many foundation models train on Common Crawl data, blocking CCBot can reduce your brand's presence across a wide range of AI systems — including those without their own named crawler. Allowing it is a broad-coverage decision.
Related
Check your site
The free scan checks crawler access, robots.txt, sitemap, structured data, and discoverability — and turns the results into a prioritized fix list.