GPTBot

OpenAI's web crawler that fetches content to train and update its models.

Definition

GPTBot is the user-agent string OpenAI uses to crawl publicly accessible web pages. It is used to collect training data for models like GPT-4 and to keep those models' knowledge current. Site owners can allow or block GPTBot in robots.txt using the user-agent directive `GPTBot`.

Why it matters for AI visibility

If GPTBot is blocked, OpenAI's models are less likely to have indexed your brand's most accurate, up-to-date pages — reducing the chance of appearing in ChatGPT-generated recommendations.

ClaudeBotAnthropic's crawler used to collect content for training and grounding Claude models.

PerplexityBotPerplexity AI's crawler that indexes content for real-time answer generation.

Google-ExtendedGoogle's opt-out user-agent for AI product training, separate from regular search crawling.

CCBotThe Common Crawl bot that builds the publicly accessible web archive used to train many AI models.

robots.txtA plain-text file at the root of your domain that tells crawlers which paths they may or may not fetch.

↗ Checklist: AI crawler accessGPTBot, ClaudeBot, PerplexityBot, Google-Extended, and other AI crawlers need clear permission to fetch important pages.

Check your site

The free scan checks crawler access, robots.txt, sitemap, structured data, and discoverability — and turns the results into a prioritized fix list.

Run the free scan →Back to glossary