Site Crawlability: The Complete SEO Guide (2026)

Saar Twito9 min read
Saar Twito
Saar TwitoFounder & SEO Engineer

Hi, I'm Saar - a software engineer, SEO specialist, and lecturer who loves building tools and teaching tech.

View author profile →

What Is Crawlability?

Crawlability is whether automated bots — Googlebot, Bingbot, GPTBot (ChatGPT), ClaudeBot (Anthropic), PerplexityBot — can discover and fetch the URLs on your site. It is the prerequisite for indexing, which is the prerequisite for ranking. If a bot cannot reach a URL, the page does not exist as far as that engine is concerned. Crawlability is decided by four things: internal links, your robots.txt, the HTTP status the server returns, and whether your content is rendered in HTML or only after JavaScript executes.

Key Facts (TL;DR)

  • Crawlability ≠ indexability. A page can be crawlable but blocked from indexing via noindex, or indexable in theory but never crawled because nothing links to it.
  • Orphan pages don't exist. A page with zero internal links and no sitemap entry is invisible to Googlebot. Internal linking is the primary discovery mechanism.
  • AI crawlers usually don't execute JavaScript. GPTBot and ClaudeBot fetch raw HTML and stop. If your content only appears after client-side React renders, it is invisible to them.
  • Googlebot does render JS via a two-pass process (Web Rendering Service), but with delay and a render budget. Server-side or static rendering is faster and more reliable.
  • robots.txt is the #1 self-inflicted wound. A single Disallow: / left over from staging takes an entire site out of Google overnight. Always check yours at /robots.txt.
  • Crawl budget only matters at scale. Google has stated sites under ~10,000 URLs effectively never have a crawl budget problem. Below that, focus on discoverability and rendering.

The Crawlability Checklist

Six requirements must be met for a URL to be reliably crawled by Google and modern AI search bots.

RequirementWhat it meansHow to verify
DiscoverableLinked from at least one other crawlable page or in sitemap.xmlScreaming Frog "Orphan URLs" report; GSC Pages report
Allowed by robots.txtNot matched by a Disallow: rule for the relevant user-agentGSC robots.txt Tester; curl https://site/robots.txt
Returns 200 OKHTTP 200, not 4xx, 5xx, or a soft 404curl -I https://site/page; GSC URL Inspection
HTML contains contentMain content present in initial HTML, not only after JS hydrationView source (Cmd+U); curl https://site/page | grep "keyword"
No noindex metaIf you want it indexed, no <meta name="robots" content="noindex">View source; GSC URL Inspection
Reasonable response timeServer responds in < 1s; doesn't time out under bot loadGSC Crawl Stats report; server logs

How Crawlers Actually Find Your Pages

There are three discovery channels. A crawlable site uses all three.

  1. Internal links. A crawler lands on one URL, parses every <a href>, and queues those URLs. Pages that are not linked are not found. Navigation, footers, contextual links inside articles, and breadcrumbs all count.
  2. XML sitemap. A list of URLs at /sitemap.xml, submitted in Google Search Console and referenced from robots.txt. This is your "here's everything" list — especially important for large sites or pages with weak internal linking.
  3. External links. Backlinks from other sites bring crawlers in to URLs they had not seen before.

robots.txt: The 5 Patterns You Need

robots.txt lives at the root (https://example.com/robots.txt) and tells bots what they may fetch. It is a request, not a security mechanism — pages disallowed in robots.txt can still be indexed if linked externally. For full guidance see the complete robots.txt guide.

# 1. Standard: allow everything, point at sitemap
User-agent: *
Allow: /
Sitemap: https://example.com/sitemap.xml

# 2. Block admin areas only
User-agent: *
Disallow: /admin/
Disallow: /wp-admin/
Disallow: /cart/
Allow: /
Sitemap: https://example.com/sitemap.xml

# 3. Block AI crawlers but allow Google
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: *
Allow: /

# 4. Block search-result and faceted URLs
User-agent: *
Disallow: /search?
Disallow: /*?sort=
Disallow: /*?filter=

# 5. THE KILLER: never deploy this to production
User-agent: *
Disallow: /

JavaScript Rendering: Why AI Crawlers See Less Than Googlebot

Modern crawlers fall into two camps: those that execute JavaScript and those that don't. This determines whether a single-page React/Vue/Angular app is visible to them. See the client-side rendering SEO guide for the full picture.

CrawlerExecutes JavaScript?Implication
GooglebotYes (headless Chromium, two-pass)JS sites can rank, but with render delay; SSR is still preferred
BingbotPartial (Edge-based)Less reliable than Googlebot for heavy JS
GPTBot (OpenAI / ChatGPT)NoSees raw HTML only — CSR sites are invisible
ClaudeBot (Anthropic)NoSame — needs HTML-rendered content
PerplexityBotNoSame

The fix: server-side render (SSR) or statically generate (SSG) the pages you want crawled. In Next.js this is the default for the App Router. Avoid hiding primary content behind useEffect fetches.

The 7 Mistakes That Break Crawlability

  1. Staging robots.txt deployed to production. Disallow: / survives a release and the entire site disappears. Always diff robots.txt before deploy.
  2. Orphan pages. A page exists at a URL but nothing links to it and it's not in the sitemap. Googlebot never finds it.
  3. Soft 404s. The server returns 200 OK with a "Page not found" body. Google de-indexes these. Return real 404 or 410. See HTTP status codes for SEO.
  4. Critical content rendered client-side only. The product price, the article body, the H1 — all injected by JS after page load. AI crawlers see nothing.
  5. Infinite redirect chains. A → B → C → A. Crawlers follow up to ~5 hops, then give up.
  6. Blocking CSS/JS in robots.txt. Google needs to render the page to evaluate it. Blocking /static/ or /_next/ breaks rendering.
  7. Slow server. If the server takes 8s to respond, Googlebot reduces crawl rate and many URLs simply don't get crawled.

How to Audit Crawlability

A repeatable playbook with the exact tool names and reports.

  • Google Search Console > Pages. The buckets to inspect: Discovered – currently not indexed (Google found the URL but hasn't crawled it — usually a quality or budget signal), Crawled – currently not indexed (crawled but Google chose not to index), Blocked by robots.txt, Page with redirect, Soft 404, Server error (5xx).
  • Google Search Console > Settings > Crawl Stats. Shows crawl requests per day, average response time, and the host status. A response time spike or host status "Has problems" is the early warning.
  • URL Inspection tool. Paste any URL — GSC tells you if it's indexed, the canonical Google chose, and the rendered HTML Googlebot saw. Run Test live URL to see the current state.
  • Screaming Frog SEO Spider. Crawl your domain with Respect robots.txt off and on, compare. Check the Response Codes tab for 4xx/5xx, the Directives tab for noindex pages, and the Orphan URLs report (after connecting a sitemap and GA).
  • Server logs. Filter by user-agent Googlebot, GPTBot, ClaudeBot. Pages bots never visit are invisible regardless of what other tools say.
  • Greadme's crawler. Multi-page audit that flags blocked URLs, broken internal links, JS-only content, and missing sitemap entries in one pass.

Crawl Budget (Only Read This If You Have ≥ 10,000 URLs)

Crawl budget is the number of URLs Google is willing to crawl on your site in a given period. It's set by two factors: crawl capacity (how much load your server can take) and crawl demand(how much Google wants your content). For sites under ~10,000 URLs, this is irrelevant — Google can crawl your whole site daily. For large sites, the levers are: kill duplicate URLs (parameter URLs, filter combinations), return correct status codes (don't serve 200 on dead pages), keep server fast, and prioritize high-value URLs in the sitemap.

FAQ

How do I know if a page is crawlable?

Paste the URL into Google Search Console's URL Inspection tool and click Test live URL. It tells you if Googlebot can fetch and render the page right now.

What's the difference between crawlable and indexable?

Crawlable = the bot can fetch the URL. Indexable = Google is allowed to and chooses to add it to the index. noindex blocks indexing without blocking crawling.

Should I block AI crawlers like GPTBot?

That's a content-licensing decision, not an SEO one. Blocking GPTBot does not affect Google rankings, but it removes your content from ChatGPT's training and citations. Most publishers leave it open for visibility.

Why is Google "Discovered – currently not indexed"?

Google knows the URL exists (from a sitemap or link) but has chosen not to crawl it yet. Usually means quality or authority is low, or crawl budget is constrained. Improve internal linking and content quality.

Does robots.txt stop a page from being indexed?

No. robots.txt stops crawling, not indexing. A URL blocked in robots.txt can still appear in search results (without snippet) if linked externally. To prevent indexing, use <meta name="robots" content="noindex"> — and the page must be crawlable for Google to see that tag.

Do I need a sitemap if my internal linking is good?

For small sites, no. For sites over a few hundred URLs, yes — it speeds up discovery of new content and surfaces orphan or weakly-linked pages.

How often does Googlebot crawl my site?

Varies from minutes (news sites, frequent updates) to weeks (small static sites). Check Crawl Stats in GSC for actuals. Frequency follows quality and update rate.

Can I force Google to recrawl a page?

Use the URL Inspection tool and click Request indexing. Quota is limited (~10/day). For bulk, resubmit the sitemap.

Conclusion

Crawlability is the unglamorous foundation of SEO and AI visibility. Get the basics right: every important URL is internally linked, listed in sitemap.xml, returns 200 OK, contains its content in raw HTML, and is not accidentally blocked in robots.txt. Audit with Search Console's Pages report and Screaming Frog. Everything else — keywords, schema, backlinks — only pays off once crawlers can actually reach the page.