Site Crawlability: The Complete SEO Guide (2026)
What Is Crawlability?
Crawlability is whether automated bots — Googlebot, Bingbot, GPTBot (ChatGPT), ClaudeBot (Anthropic), PerplexityBot — can discover and fetch the URLs on your site. It is the prerequisite for indexing, which is the prerequisite for ranking. If a bot cannot reach a URL, the page does not exist as far as that engine is concerned. Crawlability is decided by four things: internal links, your robots.txt, the HTTP status the server returns, and whether your content is rendered in HTML or only after JavaScript executes.
Key Facts (TL;DR)
- Crawlability ≠ indexability. A page can be crawlable but blocked from indexing via
noindex, or indexable in theory but never crawled because nothing links to it. - Orphan pages don't exist. A page with zero internal links and no sitemap entry is invisible to Googlebot. Internal linking is the primary discovery mechanism.
- AI crawlers usually don't execute JavaScript. GPTBot and ClaudeBot fetch raw HTML and stop. If your content only appears after client-side React renders, it is invisible to them.
- Googlebot does render JS via a two-pass process (Web Rendering Service), but with delay and a render budget. Server-side or static rendering is faster and more reliable.
robots.txtis the #1 self-inflicted wound. A singleDisallow: /left over from staging takes an entire site out of Google overnight. Always check yours at/robots.txt.- Crawl budget only matters at scale. Google has stated sites under ~10,000 URLs effectively never have a crawl budget problem. Below that, focus on discoverability and rendering.
The Crawlability Checklist
Six requirements must be met for a URL to be reliably crawled by Google and modern AI search bots.
| Requirement | What it means | How to verify |
|---|---|---|
| Discoverable | Linked from at least one other crawlable page or in sitemap.xml | Screaming Frog "Orphan URLs" report; GSC Pages report |
| Allowed by robots.txt | Not matched by a Disallow: rule for the relevant user-agent | GSC robots.txt Tester; curl https://site/robots.txt |
| Returns 200 OK | HTTP 200, not 4xx, 5xx, or a soft 404 | curl -I https://site/page; GSC URL Inspection |
| HTML contains content | Main content present in initial HTML, not only after JS hydration | View source (Cmd+U); curl https://site/page | grep "keyword" |
No noindex meta | If you want it indexed, no <meta name="robots" content="noindex"> | View source; GSC URL Inspection |
| Reasonable response time | Server responds in < 1s; doesn't time out under bot load | GSC Crawl Stats report; server logs |
How Crawlers Actually Find Your Pages
There are three discovery channels. A crawlable site uses all three.
- Internal links. A crawler lands on one URL, parses every
<a href>, and queues those URLs. Pages that are not linked are not found. Navigation, footers, contextual links inside articles, and breadcrumbs all count. - XML sitemap. A list of URLs at
/sitemap.xml, submitted in Google Search Console and referenced fromrobots.txt. This is your "here's everything" list — especially important for large sites or pages with weak internal linking. - External links. Backlinks from other sites bring crawlers in to URLs they had not seen before.
robots.txt: The 5 Patterns You Need
robots.txt lives at the root (https://example.com/robots.txt) and tells bots what they may fetch. It is a request, not a security mechanism — pages disallowed in robots.txt can still be indexed if linked externally. For full guidance see the complete robots.txt guide.
# 1. Standard: allow everything, point at sitemap
User-agent: *
Allow: /
Sitemap: https://example.com/sitemap.xml
# 2. Block admin areas only
User-agent: *
Disallow: /admin/
Disallow: /wp-admin/
Disallow: /cart/
Allow: /
Sitemap: https://example.com/sitemap.xml
# 3. Block AI crawlers but allow Google
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: *
Allow: /
# 4. Block search-result and faceted URLs
User-agent: *
Disallow: /search?
Disallow: /*?sort=
Disallow: /*?filter=
# 5. THE KILLER: never deploy this to production
User-agent: *
Disallow: /JavaScript Rendering: Why AI Crawlers See Less Than Googlebot
Modern crawlers fall into two camps: those that execute JavaScript and those that don't. This determines whether a single-page React/Vue/Angular app is visible to them. See the client-side rendering SEO guide for the full picture.
| Crawler | Executes JavaScript? | Implication |
|---|---|---|
| Googlebot | Yes (headless Chromium, two-pass) | JS sites can rank, but with render delay; SSR is still preferred |
| Bingbot | Partial (Edge-based) | Less reliable than Googlebot for heavy JS |
| GPTBot (OpenAI / ChatGPT) | No | Sees raw HTML only — CSR sites are invisible |
| ClaudeBot (Anthropic) | No | Same — needs HTML-rendered content |
| PerplexityBot | No | Same |
The fix: server-side render (SSR) or statically generate (SSG) the pages you want crawled. In Next.js this is the default for the App Router. Avoid hiding primary content behind useEffect fetches.
The 7 Mistakes That Break Crawlability
- Staging
robots.txtdeployed to production.Disallow: /survives a release and the entire site disappears. Always diffrobots.txtbefore deploy. - Orphan pages. A page exists at a URL but nothing links to it and it's not in the sitemap. Googlebot never finds it.
- Soft 404s. The server returns 200 OK with a "Page not found" body. Google de-indexes these. Return real 404 or 410. See HTTP status codes for SEO.
- Critical content rendered client-side only. The product price, the article body, the H1 — all injected by JS after page load. AI crawlers see nothing.
- Infinite redirect chains. A → B → C → A. Crawlers follow up to ~5 hops, then give up.
- Blocking CSS/JS in
robots.txt. Google needs to render the page to evaluate it. Blocking/static/or/_next/breaks rendering. - Slow server. If the server takes 8s to respond, Googlebot reduces crawl rate and many URLs simply don't get crawled.
How to Audit Crawlability
A repeatable playbook with the exact tool names and reports.
- Google Search Console > Pages. The buckets to inspect: Discovered – currently not indexed (Google found the URL but hasn't crawled it — usually a quality or budget signal), Crawled – currently not indexed (crawled but Google chose not to index), Blocked by robots.txt, Page with redirect, Soft 404, Server error (5xx).
- Google Search Console > Settings > Crawl Stats. Shows crawl requests per day, average response time, and the host status. A response time spike or host status "Has problems" is the early warning.
- URL Inspection tool. Paste any URL — GSC tells you if it's indexed, the canonical Google chose, and the rendered HTML Googlebot saw. Run Test live URL to see the current state.
- Screaming Frog SEO Spider. Crawl your domain with Respect robots.txt off and on, compare. Check the Response Codes tab for 4xx/5xx, the Directives tab for noindex pages, and the Orphan URLs report (after connecting a sitemap and GA).
- Server logs. Filter by user-agent
Googlebot,GPTBot,ClaudeBot. Pages bots never visit are invisible regardless of what other tools say. - Greadme's crawler. Multi-page audit that flags blocked URLs, broken internal links, JS-only content, and missing sitemap entries in one pass.
Crawl Budget (Only Read This If You Have ≥ 10,000 URLs)
Crawl budget is the number of URLs Google is willing to crawl on your site in a given period. It's set by two factors: crawl capacity (how much load your server can take) and crawl demand(how much Google wants your content). For sites under ~10,000 URLs, this is irrelevant — Google can crawl your whole site daily. For large sites, the levers are: kill duplicate URLs (parameter URLs, filter combinations), return correct status codes (don't serve 200 on dead pages), keep server fast, and prioritize high-value URLs in the sitemap.
FAQ
How do I know if a page is crawlable?
Paste the URL into Google Search Console's URL Inspection tool and click Test live URL. It tells you if Googlebot can fetch and render the page right now.
What's the difference between crawlable and indexable?
Crawlable = the bot can fetch the URL. Indexable = Google is allowed to and chooses to add it to the index. noindex blocks indexing without blocking crawling.
Should I block AI crawlers like GPTBot?
That's a content-licensing decision, not an SEO one. Blocking GPTBot does not affect Google rankings, but it removes your content from ChatGPT's training and citations. Most publishers leave it open for visibility.
Why is Google "Discovered – currently not indexed"?
Google knows the URL exists (from a sitemap or link) but has chosen not to crawl it yet. Usually means quality or authority is low, or crawl budget is constrained. Improve internal linking and content quality.
Does robots.txt stop a page from being indexed?
No. robots.txt stops crawling, not indexing. A URL blocked in robots.txt can still appear in search results (without snippet) if linked externally. To prevent indexing, use <meta name="robots" content="noindex"> — and the page must be crawlable for Google to see that tag.
Do I need a sitemap if my internal linking is good?
For small sites, no. For sites over a few hundred URLs, yes — it speeds up discovery of new content and surfaces orphan or weakly-linked pages.
How often does Googlebot crawl my site?
Varies from minutes (news sites, frequent updates) to weeks (small static sites). Check Crawl Stats in GSC for actuals. Frequency follows quality and update rate.
Can I force Google to recrawl a page?
Use the URL Inspection tool and click Request indexing. Quota is limited (~10/day). For bulk, resubmit the sitemap.
Conclusion
Crawlability is the unglamorous foundation of SEO and AI visibility. Get the basics right: every important URL is internally linked, listed in sitemap.xml, returns 200 OK, contains its content in raw HTML, and is not accidentally blocked in robots.txt. Audit with Search Console's Pages report and Screaming Frog. Everything else — keywords, schema, backlinks — only pays off once crawlers can actually reach the page.
