Crawlability is whether automated bots — Googlebot, Bingbot, GPTBot (ChatGPT), ClaudeBot (Anthropic), PerplexityBot — can discover and fetch the URLs on your site. It is the prerequisite for indexing, which is the prerequisite for ranking. If a bot cannot reach a URL, the page does not exist as far as that engine is concerned. Crawlability is decided by four things: internal links, your robots.txt, the HTTP status the server returns, and whether your content is rendered in HTML or only after JavaScript executes.
noindex, or indexable in theory but never crawled because nothing links to it.robots.txt is the #1 self-inflicted wound. A single Disallow: / left over from staging takes an entire site out of Google overnight. Always check yours at /robots.txt.Six requirements must be met for a URL to be reliably crawled by Google and modern AI search bots.
| Requirement | What it means | How to verify |
|---|---|---|
| Discoverable | Linked from at least one other crawlable page or in sitemap.xml | Screaming Frog "Orphan URLs" report; GSC Pages report |
| Allowed by robots.txt | Not matched by a Disallow: rule for the relevant user-agent | GSC robots.txt Tester; curl https://site/robots.txt |
| Returns 200 OK | HTTP 200, not 4xx, 5xx, or a soft 404 | curl -I https://site/page; GSC URL Inspection |
| HTML contains content | Main content present in initial HTML, not only after JS hydration | View source (Cmd+U); curl https://site/page | grep "keyword" |
No noindex meta | If you want it indexed, no <meta name="robots" content="noindex"> | View source; GSC URL Inspection |
| Reasonable response time | Server responds in < 1s; doesn't time out under bot load | GSC Crawl Stats report; server logs |
There are three discovery channels. A crawlable site uses all three.
<a href>, and queues those URLs. Pages that are not linked are not found. Navigation, footers, contextual links inside articles, and breadcrumbs all count./sitemap.xml, submitted in Google Search Console and referenced from robots.txt. This is your "here's everything" list — especially important for large sites or pages with weak internal linking.robots.txt lives at the root (https://example.com/robots.txt) and tells bots what they may fetch. It is a request, not a security mechanism — pages disallowed in robots.txt can still be indexed if linked externally. For full guidance see the complete robots.txt guide.
# 1. Standard: allow everything, point at sitemap
User-agent: *
Allow: /
Sitemap: https://example.com/sitemap.xml
# 2. Block admin areas only
User-agent: *
Disallow: /admin/
Disallow: /wp-admin/
Disallow: /cart/
Allow: /
Sitemap: https://example.com/sitemap.xml
# 3. Block AI crawlers but allow Google
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: *
Allow: /
# 4. Block search-result and faceted URLs
User-agent: *
Disallow: /search?
Disallow: /*?sort=
Disallow: /*?filter=
# 5. THE KILLER: never deploy this to production
User-agent: *
Disallow: /Modern crawlers fall into two camps: those that execute JavaScript and those that don't. This determines whether a single-page React/Vue/Angular app is visible to them. See the client-side rendering SEO guide for the full picture.
| Crawler | Executes JavaScript? | Implication |
|---|---|---|
| Googlebot | Yes (headless Chromium, two-pass) | JS sites can rank, but with render delay; SSR is still preferred |
| Bingbot | Partial (Edge-based) | Less reliable than Googlebot for heavy JS |
| GPTBot (OpenAI / ChatGPT) | No | Sees raw HTML only — CSR sites are invisible |
| ClaudeBot (Anthropic) | No | Same — needs HTML-rendered content |
| PerplexityBot | No | Same |
The fix: server-side render (SSR) or statically generate (SSG) the pages you want crawled. In Next.js this is the default for the App Router. Avoid hiding primary content behind useEffect fetches.
robots.txt deployed to production. Disallow: / survives a release and the entire site disappears. Always diff robots.txt before deploy.robots.txt. Google needs to render the page to evaluate it. Blocking /static/ or /_next/ breaks rendering.A repeatable playbook with the exact tool names and reports.
Googlebot, GPTBot, ClaudeBot. Pages bots never visit are invisible regardless of what other tools say.Crawl budget is the number of URLs Google is willing to crawl on your site in a given period. It's set by two factors: crawl capacity (how much load your server can take) and crawl demand(how much Google wants your content). For sites under ~10,000 URLs, this is irrelevant — Google can crawl your whole site daily. For large sites, the levers are: kill duplicate URLs (parameter URLs, filter combinations), return correct status codes (don't serve 200 on dead pages), keep server fast, and prioritize high-value URLs in the sitemap.
Paste the URL into Google Search Console's URL Inspection tool and click Test live URL. It tells you if Googlebot can fetch and render the page right now.
Crawlable = the bot can fetch the URL. Indexable = Google is allowed to and chooses to add it to the index. noindex blocks indexing without blocking crawling.
That's a content-licensing decision, not an SEO one. Blocking GPTBot does not affect Google rankings, but it removes your content from ChatGPT's training and citations. Most publishers leave it open for visibility.
Google knows the URL exists (from a sitemap or link) but has chosen not to crawl it yet. Usually means quality or authority is low, or crawl budget is constrained. Improve internal linking and content quality.
robots.txt stop a page from being indexed?No. robots.txt stops crawling, not indexing. A URL blocked in robots.txt can still appear in search results (without snippet) if linked externally. To prevent indexing, use <meta name="robots" content="noindex"> — and the page must be crawlable for Google to see that tag.
For small sites, no. For sites over a few hundred URLs, yes — it speeds up discovery of new content and surfaces orphan or weakly-linked pages.
Varies from minutes (news sites, frequent updates) to weeks (small static sites). Check Crawl Stats in GSC for actuals. Frequency follows quality and update rate.
Use the URL Inspection tool and click Request indexing. Quota is limited (~10/day). For bulk, resubmit the sitemap.
Crawlability is the unglamorous foundation of SEO and AI visibility. Get the basics right: every important URL is internally linked, listed in sitemap.xml, returns 200 OK, contains its content in raw HTML, and is not accidentally blocked in robots.txt. Audit with Search Console's Pages report and Screaming Frog. Everything else — keywords, schema, backlinks — only pays off once crawlers can actually reach the page.