Crawl Scan is a Greadme feature that systematically discovers and audits every page on your website by following internal links — the same way a search engine bot does. You enter one URL; it returns a site-wide report covering broken links, missing alt text, meta tag gaps, heading-structure issues, and template-level problems that only appear when you analyze the whole site.
The crawler runs as a four-step pipeline. Understanding it helps you interpret results and configure scans for maximum value.
You provide a starting URL — typically your homepage — and configure the crawl settings. The crawler fetches that page and extracts every internal link it finds.
The crawler follows each discovered internal link, visiting new pages and finding more links. This recursive process continues until it reaches the configured page limit or has visited every discoverable page. No sitemap is required — the crawler systematically follows links to find content.
As the crawler visits each page, it runs checks on the content, meta tags, headings, images, and outbound links. Every issue is recorded with its severity, location, and type.
Once the crawl finishes, all findings are aggregated into a comprehensive report showing site-wide patterns, issue counts, and individual page details.
Before crawling, the scanner reads your site's robots.txt file and respects its directives. If certain paths are disallowed, the crawler skips them — just like Googlebot would. The results include a full robots.txt analysis showing what restrictions are in place. For more on configuring this file correctly, see our guide on setting up robots.txt.
Each discovered page is evaluated against a set of SEO and content quality checks. The table below summarizes every category and why each check matters.
| Category | Specific check | Why it matters |
|---|---|---|
| Images | Missing alt text | Inaccessible to screen readers and invisible to search engines |
| Generic alt text (e.g. "image1.jpg") | Provides no accessibility or SEO value | |
| Outdated image format (PNG/JPEG vs WebP) | Modern formats deliver significant file-size savings and faster page loads | |
| Meta tags | Missing or incorrect title tag | The single most important on-page SEO element — appears in SERPs, tabs, and shares |
| Missing or incorrect meta description | Without it you lose control of your search-result snippet | |
| Missing canonical tag | Search engines may index duplicate URLs and dilute SEO authority | |
| Social tags | Missing Open Graph (og:title, og:description, og:image) | Shares on Facebook/LinkedIn show generic, unappealing previews |
| Missing Twitter Card tags | X/Twitter shares lose engaging previews and click-through | |
| Headings & content | Heading hierarchy (missing H1, multiple H1s, skipped levels) | Affects both SEO crawling and accessibility tree |
| Content length / thin content | Very-low-content pages can hurt rankings | |
| Link health | Broken internal links (404 pages) | Detects 404s and reports the "linked from" pages so you can fix the source |
For a deeper look at fixing 404s specifically, see our guide on how to find and fix broken links.
The results are designed to give you both a bird's-eye view of your site's health and the ability to drill down into specific issues on specific pages.
At the top of your results, six key metrics summarize the entire crawl:
The results include a dedicated section for your robots.txt configuration, showing:
It's surprisingly common for sites to accidentally block important content in robots.txt. If Crawl Scan flags critical paths as disallowed, review your robots.txt to make sure you're not unintentionally hiding content from search engines.
With potentially hundreds of pages in your results, filtering is essential. The results include:
Clicking any page opens a detail view with:
When the crawler discovers a 404, it doesn't just tell you the page is broken — it tells you which other pages link to it. Fixing a 404 isn't about the dead page itself (it's already gone); it's about updating every link that points to it. This data saves the detective work of tracking down broken-link sources manually.
Crawl Scan offers three configuration options that let you tailor the analysis to your site:
Control how many pages the crawler will analyze. Options range from 50 to 500. Smaller sites need a lower limit; larger sites should bump the cap to ensure coverage. Start at 100 if you're unsure — you can always re-run with a higher limit.
Crawl depth determines how many link-clicks deep the crawler will go from your starting page. A depth of 3 means the crawler follows links up to three levels from the starting URL — usually enough to discover most pages on a well-structured site.
Choose whether the crawler should also follow links to subdomains (like blog.example.com or shop.example.com). Enable this if your site uses subdomains for sections you want included in the audit.
Certain patterns only become visible when you analyze an entire site rather than individual pages. Here are the most common site-wide issues Crawl Scan uncovers.
Pattern: the same issue appears on dozens or hundreds of pages.
What it means: when you see the same issue (missing OG tags, duplicate H1 patterns) across many pages, it's usually caused by a shared template or layout component, not by individual page content. Fixing the template fixes every affected page at once.
Example: 150 of 200 pages are missing twitter:image tags because the base template doesn't include Twitter Card meta.
Pattern: important pages that aren't linked from anywhere.
What it means: if the crawler can't find a page by following links, search engines probably can't either. Pages that exist but aren't linked from your navigation or content are effectively invisible.
How to detect: compare the pages the crawler found with your sitemap or CMS page list. Pages in the CMS that weren't found during the crawl are likely orphaned.
Pattern: multiple pages linking to the same 404 URL.
What it means: when a page is deleted or its URL changes without a redirect, every page that linked to it now has a broken link. The crawler's "linked from" data reveals these chains.
Priority: fix 404s with the most "linked from" pages first — they cause the most broken user experiences.
Pattern: newer pages are well-optimized while older pages have many issues.
What it means: as teams learn and improve their practices, newer content is often better optimized — but older content doesn't improve on its own. The crawl reveals which sections of your site have been left behind.
For larger sites, working with results inside the browser may not be enough. Crawl Scan supports exporting your complete results to CSV with full Unicode support for international content. The export includes every page URL with HTTP status code, page titles, issue counts and types per page, and detailed issue descriptions.
You can also generate shareable links to your crawl results, allowing teammates or clients to browse the full interactive report without needing a Greadme account.
Run your first crawl with a generous page limit to understand the overall scope. Then use the issue-type filters to focus on one category at a time — fix all missing alt text first, then meta descriptions, then heading structure. This is more efficient than fixing pages one by one.
If you see the same issue across many pages, find the shared template or component causing it. A single template fix can resolve issues on hundreds of pages at once. Always look for patterns before diving into individual fixes.
Websites change constantly — new content is added, old pages are deleted, plugins update, templates evolve. Monthly crawls catch new issues early before they compound.
The most effective workflow combines both scan types: use Crawl Scan to identify which pages have issues, then run Deep Scan on your most important pages to get the full 100+ parameter analysis including performance metrics, schema validation, and AI-powered recommendations.
Broken pages hurt both user experience and SEO. Use the "linked from" data to fix them systematically — restore the content, set up redirects, or update the source links.
Greadme's crawler is designed to behave responsibly toward the sites it analyzes:
If your firewall or bot protection blocks the crawl, you can allowlist GreadmeBot in your WAF or firewall rules. The crawler uses a clearly identifiable user-agent string, so it's easy to distinguish from malicious bots. See the Greadme bot documentation page for specific allowlisting instructions.
Most crawls finish in a few minutes. Time scales with the number of pages and your server's response time — a 100-page crawl typically completes in 1–3 minutes, while a 500-page crawl on a slower server may take 10–15 minutes. The total crawl time is reported in your summary statistics.
No. Crawl Scan discovers pages by following internal links, just like a search engine bot. A sitemap can help search engines, but it's not required for Crawl Scan to find your content. If a page isn't reachable through internal links, it's an orphan — and that's a finding worth knowing about.
Crawl Scan analyzes many pages broadly to surface site-wide and template-level issues. Deep Scan analyzes a single page in depth, running 190+ parameters including Lighthouse performance metrics, full schema validation, and AI-powered recommendations. Use Crawl Scan to find which pages need attention; use Deep Scan to fix your highest-value pages thoroughly.
It shouldn't. The crawler analyzes pages in controlled batches and honors any crawl-delay set in your robots.txt. For most production sites the load is comparable to a routine search-engine crawl. If your server is unusually small or under load, lower the max-pages setting or set a crawl-delay in robots.txt.
Crawl Scan analyzes the HTML returned by the server. If your site relies heavily on client-side rendering and ships a near-empty initial HTML, important content and links may not be visible to the crawler — which is the same problem Googlebot has historically had with such sites. Server-side rendering or static generation gives the crawler (and search engines) the most complete picture.
The current per-crawl cap is 500 pages. For very large sites, run multiple crawls starting from different sections (blog, product hub, docs root, etc.) using crawl depth and include-subdomains to control scope. Combine the exported CSVs to build a full picture, and use the patterns surfaced in each batch to prioritize template-level fixes that ripple across thousands of pages at once.
Crawl Scan respects robots.txt, so the crawler will only visit paths the site owner allows for bots. It will identify itself as GreadmeBot. Crawling a competitor's public pages for research is generally permitted, but you should review the target site's terms of service and avoid running aggressive or repeated scans against sites you don't own.
Monthly is a good baseline for most sites. Re-crawl after major releases, template changes, content migrations, or domain moves. A broken link that exists for a week is a minor inconvenience; one that persists for six months is an SEO problem.
Site-wide problems stay hidden when you only audit one page at a time. Crawl Scan replaces guesswork with a complete map of your site — every discoverable page checked for the SEO, accessibility, and link-health issues that compound over time. Use it to surface template-level patterns and broken-link chains, then drill into specific pages with Deep Scan to fix what matters most.