robots.txt is a plain-text file at the root of your site (example.com/robots.txt) that tells crawlers — Google, Bing, GPTBot, ClaudeBot, PerplexityBot, and every other compliant bot — which URL paths they may or may not request. It controls crawling, not indexing. The protocol was formalized as RFC 9309in 2022 and is followed by all major search and AI crawlers, but it's a request, not a wall — bad actors can ignore it.
https://yourdomain.com/robots.txt — not in subdirectories.Disallow blocks crawling. It does NOT remove a page from Google's index — pages already indexed can stay in results with no snippet. To remove from the index, use the noindex meta tag (and don't block the URL in robots.txt, or Google can't see the noindex).noindex in robots.txt was deprecated. Google stopped supporting Noindex: directives in robots.txt on September 1, 2019. Use the meta tag or X-Robots-Tag HTTP header instead.Crawl-delay. Bing and Yandex respect it. Set crawl rate in Search Console for Google.Googlebot does NOT block GPTBot. You need a separate block for each AI bot you want to exclude.Since 2023, every major AI company has launched a crawler. Each has its own user-agent string and its own purpose — training data vs. real-time retrieval. Decide whether to allow each one based on whether you want your content used for that purpose.
| User-agent | Operator | Purpose | Recommended for visibility |
|---|---|---|---|
Googlebot | Search index | Allow | |
Bingbot | Microsoft | Search index (also feeds ChatGPT search) | Allow |
Google-Extended | Gemini training (does NOT affect Search) | Allow for AI visibility | |
GPTBot | OpenAI | Model training | Allow for AI training inclusion |
OAI-SearchBot | OpenAI | ChatGPT search retrieval | Allow for ChatGPT citation |
ChatGPT-User | OpenAI | User-triggered fetches in ChatGPT | Allow for in-chat link previews |
ClaudeBot | Anthropic | Model training | Allow for AI training inclusion |
Claude-SearchBot | Anthropic | Claude search retrieval | Allow for Claude citation |
Claude-User | Anthropic | User-triggered fetches in Claude | Allow |
PerplexityBot | Perplexity | Search retrieval and citation | Allow for Perplexity citation |
Perplexity-User | Perplexity | User-triggered fetches | Allow |
Applebot | Apple | Spotlight, Siri, Safari suggestions | Allow |
Applebot-Extended | Apple | Apple Intelligence training | Allow for Apple AI inclusion |
CCBot | Common Crawl | Open dataset (used by many AI models) | Allow for broad AI inclusion |
Bytespider | ByteDance | TikTok / Doubao training | Optional |
Strategic note: Many publishers blocked AI crawlers in 2023–2024, then quietly re-allowed them as AI search drove an increasing share of referral traffic. Industry data from late 2025 shows ChatGPT, Perplexity, and Google AI Overviews collectively drive 5–15% of organic discovery for content sites — and that share is rising. Blocking AI crawlers blocks your future visibility.
Each block starts with one or more User-agent lines, followed by Disallow and Allow rules. Rules apply to URL paths, not URLs — they are case-sensitive on the path component.
# Apply to all crawlers
User-agent: *
Disallow: /admin/ # block this directory
Disallow: /search # block any URL starting with /search
Allow: /admin/help # but allow this subpath
# Specific rule for one bot — overrides the * block
User-agent: Googlebot
Allow: /
# Wildcards
Disallow: /*?print= # block any URL with ?print=
Disallow: /*.pdf$ # block all PDFs ($ = end of URL)
# Sitemap (absolute URL, can appear anywhere in the file)
Sitemap: https://example.com/sitemap.xmlUser-agent: GPTBot block overrides User-agent: * for GPTBot only.Allow beats Disallow when both match a URL and the Allow path is more specific (longer).* matches any sequence; $ matches end of URL.# on a line is ignored.Disallow: means "allow everything" — the same as Allow: /.User-agent: *
Disallow: /admin/
Disallow: /wp-admin/
Disallow: /cgi-bin/
Disallow: /search
Disallow: /*?utm_
Sitemap: https://example.com/sitemap.xmlUser-agent: *
Disallow: /cart
Disallow: /checkout
Disallow: /account
Disallow: /*?sort=
Disallow: /*?filter=
Disallow: /*?color=
Disallow: /*?size=
Allow: /products/
Allow: /collections/
Sitemap: https://store.example.com/sitemap.xml
Sitemap: https://store.example.com/sitemap-products.xmlFaceted navigation (sort, filter, size) creates millions of near-duplicate URLs and is a top crawl-budget waster. Block at the parameter level.
User-agent: *
Disallow: /admin/
Disallow: /search
# Explicitly allow AI crawlers (they already follow * rules,
# but listing them documents your policy)
User-agent: GPTBot
Allow: /
User-agent: OAI-SearchBot
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Google-Extended
Allow: /
Sitemap: https://example.com/sitemap.xmlSome publishers want their content cited in AI answers (driving referral traffic) without their text being absorbed into training data. The pattern:
# Block training crawlers
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Bytespider
Disallow: /
# Allow retrieval/search crawlers (cite, don't train)
User-agent: OAI-SearchBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Claude-SearchBot
Allow: /
# Default rules for everyone else
User-agent: *
Disallow: /admin/
Sitemap: https://example.com/sitemap.xmlDisallow: / on the production siteUsually copied from a staging environment by accident. Blocks the entire site from every compliant crawler. Within days, organic traffic drops to zero. Always grep your robots.txt for Disallow: / alone on a line before deploying.
Blocking /css/, /js/, or /wp-content/ prevents Google from rendering your pages. Since 2014, Google has explicitly warned that this hurts rankings — its algorithms need to see the rendered page to evaluate mobile-friendliness, layout shift, and content visibility.
robots.txt is publicly readable at yourdomain.com/robots.txt. Listing Disallow: /admin-backup-2024/ tells the world that path exists. Use authentication or remove the content; never use robots.txt as security.
Disallow to remove pages from GoogleDisallow blocks crawling, not indexing. A page already in the index will remain in search results — sometimes with the message "No information is available for this page" — until you remove it. To de-index: add <meta name="robots" content="noindex">AND make sure the URL is NOT blocked by robots.txt (otherwise Google can't fetch it to see the noindex).
Disallow: /admin and Disallow: /admin/ are different rules. The first blocks /admin, /admin.html, /admin/anything, and /admin-page. The second blocks only /admin/.... Paths are case-sensitive: Disallow: /Admin/ does not block /admin/.
If robots.txt returns a 5xx server error, Google treats the entire site as disallowed for up to 30 days. Make sure your robots.txt URL is on a reliable codepath — many CDN misconfigurations route it through a slow origin and cause intermittent 503s.
GPTBot, ClaudeBot, PerplexityBot, etc., to confirm they're actually crawling (or not) what you intended.Technically no — without one, crawlers assume everything is allowed. But you should have one because (a) it's where you declare your sitemap, (b) it's how you control faceted navigation crawl waste, and (c) without it Google logs a "robots.txt not found" message every time it crawls. A minimal valid file is one line: Sitemap: https://example.com/sitemap.xml.
No. Disallow blocks crawling but pages already indexed stay in results, sometimes with no snippet. To remove from the index, use the noindexmeta tag and ensure the URL is NOT blocked in robots.txt — otherwise Google can't crawl the page to see the noindex.
Add explicit User-agent blocks for each AI training crawler with Disallow: /. The main training bots are GPTBot (OpenAI), ClaudeBot (Anthropic), Google-Extended (Google Gemini), CCBot (Common Crawl, used by many models), Applebot-Extended (Apple Intelligence), and Bytespider (ByteDance). Note: blocking these has no effect on Google Search rankings — Google-Extended is a separate product from Googlebot.
Yes. The retrieval crawlers are separate user-agents: OAI-SearchBot and ChatGPT-User (OpenAI), Claude-SearchBot and Claude-User (Anthropic), PerplexityBot and Perplexity-User(Perplexity). Allow these while disallowing the training bots. See the "Block AI training but allow AI search retrieval" template above.
Common causes:
https://yourdomain.com/robots.txt.www.example.com doesn't apply to example.com.Use Search Console's robots.txt report to see what Google actually fetched.
No. Google has stated explicitly that Crawl-delay directives are ignored. To control Googlebot crawl rate, use Search Console's crawl-rate setting. Bing, Yandex, and most other crawlers do honor Crawl-delay.
Yes. List one Sitemap:line per file. There's no limit, and it's the recommended way to surface separate product, blog, news, and image sitemaps to crawlers.
robots.txt controls crawling at the URL-pattern level for the whole site. The <meta name="robots"> tag controls indexing for one specific page after it's been crawled. Use robots.txt to save crawl budget on URLs you don't care about; use meta robots noindex to remove a specific page from the search index.
robots.txt is one short text file with disproportionate consequences. Get the syntax right, declare your sitemap, decide deliberately about each AI crawler, and never use it as a security mechanism or a substitute for noindex. Test in Search Console's robots.txt report whenever you change it, and grep for Disallow: / before every deploy.