Common Crawl is the pretraining substrate for most LLMs. Whether your domain appears — and how recently — shapes your base citation probability before any other optimization.
Common Crawl is a nonprofit that maintains an open repository of web crawl data covering billions of pages. It is the primary pretraining data source for most large language models, including those underlying GPT, Claude, and Gemini. Whether your domain appears in Common Crawl — and how frequently, and how recently — directly shapes your base citation probability.
A domain that does not appear in Common Crawl starts every LLM interaction with a structural disadvantage. There is no training data to draw from. All other citation optimizations (structured data, Wikipedia presence, Wikidata entities) assume a base level of Common Crawl presence.
How to check your current coverage
Common Crawl provides an index API that allows you to query coverage for any URL. The simplest check: use the Common Crawl Index Server at index.commoncrawl.org. Query your domain to see how many pages appear, across which crawls, and when the most recent crawl was.
Veezow's scan includes a Common Crawl presence check as part of the discoverability score. If your score shows low off-site presence, Common Crawl coverage is the first variable to check.
Factors that affect Common Crawl inclusion
Common Crawl's seed list is not random. It tends to be .com-weighted, which means .io and .co domains are systematically underrepresented — by roughly 19-22% compared to equivalent .com domains. If you are on a non-.com TLD, your coverage gap is structural and requires active remediation.
- Whether CCBot is allowed in robots.txt (a hard requirement — blocked bots mean no crawl)
- Sitemap.xml presence and quality (provides URL list for crawlers to follow)
- Inbound links from .com domains already in Common Crawl
- Domain age and historic crawl frequency
- Page load speed and accessibility during crawl windows
How to improve Common Crawl coverage
First, verify CCBot is explicitly allowed in robots.txt. Add "User-agent: CCBot / Allow: /" if it is missing. Then ensure your sitemap.xml is complete, accurate, and referenced from robots.txt.
Build links from .com domains that are already in Common Crawl. Guest posts, directory listings, partner pages, and press coverage on .com news sites all create crawl pathways to your domain.
If your domain is on .io or .co, consider whether a .com redirect or primary domain migration is warranted from a citation strategy perspective. This is a significant decision — but the 22% systematic CC coverage gap for .io domains is measurable and persistent.
Monitor your crawl frequency
Common Crawl runs monthly crawls. Check whether your domain appears in recent crawls (last 3-6 months). Stale coverage means your most recent content and updates are not reflected in LLM training data. High-crawl-frequency domains appear more current and are more accurately cited.
What this means for citation strategy
Common Crawl coverage is foundational. Before investing in Wikipedia presence, structured data, or Wikidata entities, ensure your domain is fully crawlable and frequently included. Without Common Crawl presence, no other citation optimization can reach its full potential. Run a free scan to check your Common Crawl presence and discoverability score.
Measure your current position
Veezow scans your domain for the signals covered in this playbook — robots.txt access, structured data, Common Crawl presence, bot permissions, and off-site mentions — and scores them in one report.
Run a free scan →