VEEZOW

03 / EDITIONS · 2026.03.24

Common Crawl inclusion rates diverge by TLD: .io domains lag .com by 22%

Analysis of 10,000 domains finds systematic CC coverage gaps for .io and .co domains.

Analysis of Common Crawl coverage across 10,000 active domains reveals a systematic gap by top-level domain: .io domains are included in Common Crawl at 22% lower rates than comparable .com domains, and .co domains lag by 19%. The gap is not explained by domain age or traffic volume alone.

The most likely driver: Common Crawl's seed list and crawl priority has historically been .com-weighted, and .io/.co domains — despite their prevalence in the developer and startup ecosystem — have not been proportionally represented in the crawl corpus. This directly affects LLM training data, since most foundation models draw heavily from Common Crawl.

Mitigation strategies: submit your sitemap to Common Crawl's bulk access queue, ensure your robots.txt explicitly allows CCBot, and build authoritative off-site links from .com domains that are likely in CC. The goal is to get the CC crawler to your domain and revisit it regularly.

*What this means:* If you're on a .io or .co domain, your training data footprint is structurally smaller than .com peers — and the citation gap is measurable. The fix is partial but real. Run a free scan to check your Common Crawl presence and off-site coverage.

Put this into practice

See how your domain scores on the signals covered in this edition. Veezow runs a free AI visibility scan — robots, sitemap, structured data, bot access, and off-site presence.

Run a free scan →

New every Monday

The Weekly Visibility Index in your inbox at 06:00 UTC — citation trends, engine behaviour, no product announcements.

More from Insights

2026.07.28

Freshness signals: why LLMs cite recently-updated content at higher rates — and how lastmod drives it

2026.07.21

Retrieval-augmented vs. base model citations: why optimizing for the wrong engine delays your results by months

2026.07.14

Schema consistency vs. schema completeness: what actually drives citation accuracy

All editions →

← PREVIOUS

Wikidata structured entity coverage predicts AI citation probability at 78% accuracy

NEXT →

Reddit AMAs and HN Show posts drive 3.4x citation lift in Claude answers