Robots.txt: The Complete Guide for SEO and AI Crawlers (2026)

Saar Twito9 min read
Saar Twito
Saar TwitoFounder & SEO Engineer

Hi, I'm Saar - a software engineer, SEO specialist, and lecturer who loves building tools and teaching tech.

View author profile →

What Is robots.txt?

robots.txt is a plain-text file at the root of your site (example.com/robots.txt) that tells crawlers — Google, Bing, GPTBot, ClaudeBot, PerplexityBot, and every other compliant bot — which URL paths they may or may not request. It controls crawling, not indexing. The protocol was formalized as RFC 9309in 2022 and is followed by all major search and AI crawlers, but it's a request, not a wall — bad actors can ignore it.

Key Facts (TL;DR)

  • Location: Must be at https://yourdomain.com/robots.txt — not in subdirectories.
  • Crawling vs indexing: Disallow blocks crawling. It does NOT remove a page from Google's index — pages already indexed can stay in results with no snippet. To remove from the index, use the noindex meta tag (and don't block the URL in robots.txt, or Google can't see the noindex).
  • noindex in robots.txt was deprecated. Google stopped supporting Noindex: directives in robots.txt on September 1, 2019. Use the meta tag or X-Robots-Tag HTTP header instead.
  • Google ignores Crawl-delay. Bing and Yandex respect it. Set crawl rate in Search Console for Google.
  • AI crawlers obey separate user-agents. Blocking Googlebot does NOT block GPTBot. You need a separate block for each AI bot you want to exclude.
  • File size limit: 500 KB for Google. Content beyond that is ignored.

The AI Crawlers You Need to Know About

Since 2023, every major AI company has launched a crawler. Each has its own user-agent string and its own purpose — training data vs. real-time retrieval. Decide whether to allow each one based on whether you want your content used for that purpose.

User-agentOperatorPurposeRecommended for visibility
GooglebotGoogleSearch indexAllow
BingbotMicrosoftSearch index (also feeds ChatGPT search)Allow
Google-ExtendedGoogleGemini training (does NOT affect Search)Allow for AI visibility
GPTBotOpenAIModel trainingAllow for AI training inclusion
OAI-SearchBotOpenAIChatGPT search retrievalAllow for ChatGPT citation
ChatGPT-UserOpenAIUser-triggered fetches in ChatGPTAllow for in-chat link previews
ClaudeBotAnthropicModel trainingAllow for AI training inclusion
Claude-SearchBotAnthropicClaude search retrievalAllow for Claude citation
Claude-UserAnthropicUser-triggered fetches in ClaudeAllow
PerplexityBotPerplexitySearch retrieval and citationAllow for Perplexity citation
Perplexity-UserPerplexityUser-triggered fetchesAllow
ApplebotAppleSpotlight, Siri, Safari suggestionsAllow
Applebot-ExtendedAppleApple Intelligence trainingAllow for Apple AI inclusion
CCBotCommon CrawlOpen dataset (used by many AI models)Allow for broad AI inclusion
BytespiderByteDanceTikTok / Doubao trainingOptional

Strategic note: Many publishers blocked AI crawlers in 2023–2024, then quietly re-allowed them as AI search drove an increasing share of referral traffic. Industry data from late 2025 shows ChatGPT, Perplexity, and Google AI Overviews collectively drive 5–15% of organic discovery for content sites — and that share is rising. Blocking AI crawlers blocks your future visibility.

Robots.txt Syntax

Each block starts with one or more User-agent lines, followed by Disallow and Allow rules. Rules apply to URL paths, not URLs — they are case-sensitive on the path component.

# Apply to all crawlers
User-agent: *
Disallow: /admin/        # block this directory
Disallow: /search        # block any URL starting with /search
Allow: /admin/help       # but allow this subpath

# Specific rule for one bot — overrides the * block
User-agent: Googlebot
Allow: /

# Wildcards
Disallow: /*?print=        # block any URL with ?print=
Disallow: /*.pdf$          # block all PDFs ($ = end of URL)

# Sitemap (absolute URL, can appear anywhere in the file)
Sitemap: https://example.com/sitemap.xml
  • Most-specific match wins. A User-agent: GPTBot block overrides User-agent: * for GPTBot only.
  • Allow beats Disallow when both match a URL and the Allow path is more specific (longer).
  • Wildcards: * matches any sequence; $ matches end of URL.
  • Comments: Anything after # on a line is ignored.
  • Empty Disallow: means "allow everything" — the same as Allow: /.

Production Templates

Standard site (allow all, block admin, declare sitemap)

User-agent: *
Disallow: /admin/
Disallow: /wp-admin/
Disallow: /cgi-bin/
Disallow: /search
Disallow: /*?utm_

Sitemap: https://example.com/sitemap.xml

E-commerce (block faceted navigation, keep product pages crawlable)

User-agent: *
Disallow: /cart
Disallow: /checkout
Disallow: /account
Disallow: /*?sort=
Disallow: /*?filter=
Disallow: /*?color=
Disallow: /*?size=
Allow: /products/
Allow: /collections/

Sitemap: https://store.example.com/sitemap.xml
Sitemap: https://store.example.com/sitemap-products.xml

Faceted navigation (sort, filter, size) creates millions of near-duplicate URLs and is a top crawl-budget waster. Block at the parameter level.

Allow AI crawlers explicitly (recommended for content sites)

User-agent: *
Disallow: /admin/
Disallow: /search

# Explicitly allow AI crawlers (they already follow * rules,
# but listing them documents your policy)
User-agent: GPTBot
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Allow: /

Sitemap: https://example.com/sitemap.xml

Block AI training but allow AI search retrieval

Some publishers want their content cited in AI answers (driving referral traffic) without their text being absorbed into training data. The pattern:

# Block training crawlers
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Bytespider
Disallow: /

# Allow retrieval/search crawlers (cite, don't train)
User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Claude-SearchBot
Allow: /

# Default rules for everyone else
User-agent: *
Disallow: /admin/

Sitemap: https://example.com/sitemap.xml

The Six Mistakes That De-Index Sites

1. Disallow: / on the production site

Usually copied from a staging environment by accident. Blocks the entire site from every compliant crawler. Within days, organic traffic drops to zero. Always grep your robots.txt for Disallow: / alone on a line before deploying.

2. Blocking CSS, JavaScript, or images

Blocking /css/, /js/, or /wp-content/ prevents Google from rendering your pages. Since 2014, Google has explicitly warned that this hurts rankings — its algorithms need to see the rendered page to evaluate mobile-friendliness, layout shift, and content visibility.

3. Using robots.txt to "hide" sensitive URLs

robots.txt is publicly readable at yourdomain.com/robots.txt. Listing Disallow: /admin-backup-2024/ tells the world that path exists. Use authentication or remove the content; never use robots.txt as security.

4. Using Disallow to remove pages from Google

Disallow blocks crawling, not indexing. A page already in the index will remain in search results — sometimes with the message "No information is available for this page" — until you remove it. To de-index: add <meta name="robots" content="noindex">AND make sure the URL is NOT blocked by robots.txt (otherwise Google can't fetch it to see the noindex).

5. Trailing-slash and case-sensitivity bugs

Disallow: /admin and Disallow: /admin/ are different rules. The first blocks /admin, /admin.html, /admin/anything, and /admin-page. The second blocks only /admin/.... Paths are case-sensitive: Disallow: /Admin/ does not block /admin/.

6. 5xx errors on robots.txt

If robots.txt returns a 5xx server error, Google treats the entire site as disallowed for up to 30 days. Make sure your robots.txt URL is on a reliable codepath — many CDN misconfigurations route it through a slow origin and cause intermittent 503s.

How to Test and Monitor

  1. Google Search Console → Settings → robots.txt report. Replaced the old standalone tester in late 2023. Shows the current robots.txt Google has fetched, the last fetch date, errors, and blocked URLs.
  2. URL Inspection tool(Search Console). For any URL, see "Allowed by robots.txt" or "Blocked by robots.txt" with the matching directive.
  3. Server logs. Filter for user-agents GPTBot, ClaudeBot, PerplexityBot, etc., to confirm they're actually crawling (or not) what you intended.
  4. Crawl tools (Screaming Frog, Sitebulb, Ahrefs) can simulate any user-agent and tell you which pages are blocked.

FAQ

Do I need a robots.txt file?

Technically no — without one, crawlers assume everything is allowed. But you should have one because (a) it's where you declare your sitemap, (b) it's how you control faceted navigation crawl waste, and (c) without it Google logs a "robots.txt not found" message every time it crawls. A minimal valid file is one line: Sitemap: https://example.com/sitemap.xml.

Does robots.txt remove a page from Google's index?

No. Disallow blocks crawling but pages already indexed stay in results, sometimes with no snippet. To remove from the index, use the noindexmeta tag and ensure the URL is NOT blocked in robots.txt — otherwise Google can't crawl the page to see the noindex.

How do I block AI crawlers from training on my content?

Add explicit User-agent blocks for each AI training crawler with Disallow: /. The main training bots are GPTBot (OpenAI), ClaudeBot (Anthropic), Google-Extended (Google Gemini), CCBot (Common Crawl, used by many models), Applebot-Extended (Apple Intelligence), and Bytespider (ByteDance). Note: blocking these has no effect on Google Search rankings — Google-Extended is a separate product from Googlebot.

Can I block training while allowing AI search citations?

Yes. The retrieval crawlers are separate user-agents: OAI-SearchBot and ChatGPT-User (OpenAI), Claude-SearchBot and Claude-User (Anthropic), PerplexityBot and Perplexity-User(Perplexity). Allow these while disallowing the training bots. See the "Block AI training but allow AI search retrieval" template above.

Why is Google ignoring my robots.txt rules?

Common causes:

  1. The file isn't at the exact root https://yourdomain.com/robots.txt.
  2. It returns a 5xx or 404 instead of 200.
  3. Wrong protocol or subdomain — robots.txt at www.example.com doesn't apply to example.com.
  4. The rule has a syntax error and was silently dropped.
  5. A more-specific user-agent block overrides your wildcard rule.

Use Search Console's robots.txt report to see what Google actually fetched.

Does Google honor Crawl-delay?

No. Google has stated explicitly that Crawl-delay directives are ignored. To control Googlebot crawl rate, use Search Console's crawl-rate setting. Bing, Yandex, and most other crawlers do honor Crawl-delay.

Can robots.txt have multiple sitemaps?

Yes. List one Sitemap:line per file. There's no limit, and it's the recommended way to surface separate product, blog, news, and image sitemaps to crawlers.

What's the difference between robots.txt and the meta robots tag?

robots.txt controls crawling at the URL-pattern level for the whole site. The <meta name="robots"> tag controls indexing for one specific page after it's been crawled. Use robots.txt to save crawl budget on URLs you don't care about; use meta robots noindex to remove a specific page from the search index.

Conclusion

robots.txt is one short text file with disproportionate consequences. Get the syntax right, declare your sitemap, decide deliberately about each AI crawler, and never use it as a security mechanism or a substitute for noindex. Test in Search Console's robots.txt report whenever you change it, and grep for Disallow: / before every deploy.