Glossary / Visibility Concepts
Visibility Concepts
Common Crawl
An open archive of billions of web pages used as training data for many of the world's largest AI models.
Definition
Common Crawl is a non-profit organization that has been crawling and archiving the web since 2008. Its open dataset — updated monthly, containing petabytes of web content — is one of the primary training corpora for large language models including GPT, LLaMA, Mistral, and Falcon. Appearing in Common Crawl snapshots means your content may be woven into many AI systems' base knowledge, not just those with named crawlers.
Why it matters for AI visibility
Foundation models trained on Common Crawl data carry impressions of brands formed before fine-tuning and RLHF. If your brand was absent, inaccurately described, or represented by outdated pages in historical snapshots, those impressions can be persistent and difficult to correct.
Related
Check your site
The free scan checks crawler access, robots.txt, sitemap, structured data, and discoverability — and turns the results into a prioritized fix list.