Analysis of 10,000 domains finds systematic CC coverage gaps for .io and .co domains.
Analysis of Common Crawl coverage across 10,000 active domains reveals a systematic gap by top-level domain: .io domains are included in Common Crawl at 22% lower rates than comparable .com domains, and .co domains lag by 19%. The gap is not explained by domain age or traffic volume alone.
The most likely driver: Common Crawl's seed list and crawl priority has historically been .com-weighted, and .io/.co domains — despite their prevalence in the developer and startup ecosystem — have not been proportionally represented in the crawl corpus. This directly affects LLM training data, since most foundation models draw heavily from Common Crawl.
Mitigation strategies: submit your sitemap to Common Crawl's bulk access queue, ensure your robots.txt explicitly allows CCBot, and build authoritative off-site links from .com domains that are likely in CC. The goal is to get the CC crawler to your domain and revisit it regularly.
*What this means:* If you're on a .io or .co domain, your training data footprint is structurally smaller than .com peers — and the citation gap is measurable. The fix is partial but real. Run a free scan to check your Common Crawl presence and off-site coverage.
Put this into practice
See how your domain scores on the signals covered in this edition. Veezow runs a free AI visibility scan — robots, sitemap, structured data, bot access, and off-site presence.
Run a free scan →