Crawl Scan in Greadme: Site-Wide SEO Audit Tool
What is Crawl Scan?
Crawl Scan is a Greadme feature that systematically discovers and audits every page on your website by following internal links — the same way a search engine bot does. You enter one URL; it returns a site-wide report covering broken links, missing alt text, meta tag gaps, heading-structure issues, and template-level problems that only appear when you analyze the whole site.
Key Facts (TL;DR)
- Auto-discovers pages by following internal links — no sitemap required
- Respects robots.txt and identifies as GreadmeBot for easy allowlisting
- Configurable: max pages 50–500, crawl depth, optional include-subdomains
- Per-page checks: image alt text, title/description/canonical/OG/Twitter tags, heading hierarchy, content length, broken internal links
- "Linked from" data on every 404 — find every page pointing to a dead URL
- CSV export with full Unicode support and shareable result links
How Crawl Scan works
The crawler runs as a four-step pipeline. Understanding it helps you interpret results and configure scans for maximum value.
Step 1: Starting the crawl
You provide a starting URL — typically your homepage — and configure the crawl settings. The crawler fetches that page and extracts every internal link it finds.
Step 2: Discovering pages
The crawler follows each discovered internal link, visiting new pages and finding more links. This recursive process continues until it reaches the configured page limit or has visited every discoverable page. No sitemap is required — the crawler systematically follows links to find content.
Step 3: Analyzing each page
As the crawler visits each page, it runs checks on the content, meta tags, headings, images, and outbound links. Every issue is recorded with its severity, location, and type.
Step 4: Aggregating results
Once the crawl finishes, all findings are aggregated into a comprehensive report showing site-wide patterns, issue counts, and individual page details.
robots.txt awareness
Before crawling, the scanner reads your site's robots.txt file and respects its directives. If certain paths are disallowed, the crawler skips them — just like Googlebot would. The results include a full robots.txt analysis showing what restrictions are in place. For more on configuring this file correctly, see our guide on setting up robots.txt.
What Crawl Scan checks per page
Each discovered page is evaluated against a set of SEO and content quality checks. The table below summarizes every category and why each check matters.
| Category | Specific check | Why it matters |
|---|---|---|
| Images | Missing alt text | Inaccessible to screen readers and invisible to search engines |
| Generic alt text (e.g. "image1.jpg") | Provides no accessibility or SEO value | |
| Outdated image format (PNG/JPEG vs WebP) | Modern formats deliver significant file-size savings and faster page loads | |
| Meta tags | Missing or incorrect title tag | The single most important on-page SEO element — appears in SERPs, tabs, and shares |
| Missing or incorrect meta description | Without it you lose control of your search-result snippet | |
| Missing canonical tag | Search engines may index duplicate URLs and dilute SEO authority | |
| Social tags | Missing Open Graph (og:title, og:description, og:image) | Shares on Facebook/LinkedIn show generic, unappealing previews |
| Missing Twitter Card tags | X/Twitter shares lose engaging previews and click-through | |
| Headings & content | Heading hierarchy (missing H1, multiple H1s, skipped levels) | Affects both SEO crawling and accessibility tree |
| Content length / thin content | Very-low-content pages can hurt rankings | |
| Link health | Broken internal links (404 pages) | Detects 404s and reports the "linked from" pages so you can fix the source |
For a deeper look at fixing 404s specifically, see our guide on how to find and fix broken links.
Reading your Crawl Scan results
The results are designed to give you both a bird's-eye view of your site's health and the ability to drill down into specific issues on specific pages.
Summary statistics
At the top of your results, six key metrics summarize the entire crawl:
- Total pages found — how many pages the crawler discovered by following links
- Pages analyzed — how many pages were checked (may differ from pages found if limits are reached)
- Total issues — the combined count of all errors and warnings
- 404 pages — broken or dead pages discovered
- Clean pages — pages with zero issues
- Crawl time — how long the analysis took, in seconds
robots.txt analysis
The results include a dedicated section for your robots.txt configuration, showing:
- Whether a robots.txt file was found
- Disallowed paths and directories
- Crawl-delay settings
- Any restrictions that may be preventing search engines from accessing important content
Check your robots.txt restrictions
It's surprisingly common for sites to accidentally block important content in robots.txt. If Crawl Scan flags critical paths as disallowed, review your robots.txt to make sure you're not unintentionally hiding content from search engines.
Filtering and sorting
With potentially hundreds of pages in your results, filtering is essential. The results include:
- Severity filters — view all pages, or filter by errors, warnings, passed checks, or 404 status
- Issue-type filters — focus on a single issue (e.g. "missing alt text") across every affected page
- URL search — search by URL or page title
- Sorting — sort by issue count, URL, or status code
Page detail view
Clicking any page opens a detail view with:
- All issues found on the page, organized by severity
- The complete heading structure (H1–H6 hierarchy)
- Image information with alt-text status and format details
- For 404 pages: a list of every page that links to this broken URL
The power of "linked from" data
When the crawler discovers a 404, it doesn't just tell you the page is broken — it tells you which other pages link to it. Fixing a 404 isn't about the dead page itself (it's already gone); it's about updating every link that points to it. This data saves the detective work of tracking down broken-link sources manually.
Configuring your crawl
Crawl Scan offers three configuration options that let you tailor the analysis to your site:
Maximum pages
Control how many pages the crawler will analyze. Options range from 50 to 500. Smaller sites need a lower limit; larger sites should bump the cap to ensure coverage. Start at 100 if you're unsure — you can always re-run with a higher limit.
Crawl depth
Crawl depth determines how many link-clicks deep the crawler will go from your starting page. A depth of 3 means the crawler follows links up to three levels from the starting URL — usually enough to discover most pages on a well-structured site.
Include subdomains
Choose whether the crawler should also follow links to subdomains (like blog.example.com or shop.example.com). Enable this if your site uses subdomains for sections you want included in the audit.
Common site-wide issues only crawls reveal
Certain patterns only become visible when you analyze an entire site rather than individual pages. Here are the most common site-wide issues Crawl Scan uncovers.
Template-level problems
Pattern: the same issue appears on dozens or hundreds of pages.
What it means: when you see the same issue (missing OG tags, duplicate H1 patterns) across many pages, it's usually caused by a shared template or layout component, not by individual page content. Fixing the template fixes every affected page at once.
Example: 150 of 200 pages are missing twitter:image tags because the base template doesn't include Twitter Card meta.
Orphaned pages
Pattern: important pages that aren't linked from anywhere.
What it means: if the crawler can't find a page by following links, search engines probably can't either. Pages that exist but aren't linked from your navigation or content are effectively invisible.
How to detect: compare the pages the crawler found with your sitemap or CMS page list. Pages in the CMS that weren't found during the crawl are likely orphaned.
Broken-link chains
Pattern: multiple pages linking to the same 404 URL.
What it means: when a page is deleted or its URL changes without a redirect, every page that linked to it now has a broken link. The crawler's "linked from" data reveals these chains.
Priority: fix 404s with the most "linked from" pages first — they cause the most broken user experiences.
Inconsistent content quality
Pattern: newer pages are well-optimized while older pages have many issues.
What it means: as teams learn and improve their practices, newer content is often better optimized — but older content doesn't improve on its own. The crawl reveals which sections of your site have been left behind.
Exporting your crawl results
For larger sites, working with results inside the browser may not be enough. Crawl Scan supports exporting your complete results to CSV with full Unicode support for international content. The export includes every page URL with HTTP status code, page titles, issue counts and types per page, and detailed issue descriptions.
You can also generate shareable links to your crawl results, allowing teammates or clients to browse the full interactive report without needing a Greadme account.
Crawl Scan best practices
Start broad, then focus
Run your first crawl with a generous page limit to understand the overall scope. Then use the issue-type filters to focus on one category at a time — fix all missing alt text first, then meta descriptions, then heading structure. This is more efficient than fixing pages one by one.
Fix templates before individual pages
If you see the same issue across many pages, find the shared template or component causing it. A single template fix can resolve issues on hundreds of pages at once. Always look for patterns before diving into individual fixes.
Crawl monthly
Websites change constantly — new content is added, old pages are deleted, plugins update, templates evolve. Monthly crawls catch new issues early before they compound.
Combine with Deep Scan
The most effective workflow combines both scan types: use Crawl Scan to identify which pages have issues, then run Deep Scan on your most important pages to get the full 100+ parameter analysis including performance metrics, schema validation, and AI-powered recommendations.
Pay attention to 404s
Broken pages hurt both user experience and SEO. Use the "linked from" data to fix them systematically — restore the content, set up redirects, or update the source links.
Crawler behavior and the GreadmeBot allowlist
Greadme's crawler is designed to behave responsibly toward the sites it analyzes:
- robots.txt compliance — the crawler respects robots.txt directives and skips disallowed paths
- Identified user-agent — the crawler identifies itself as GreadmeBot with a link to documentation, so site owners can recognize and allowlist it
- Rate limiting — pages are analyzed in controlled batches to avoid overwhelming your server
- Crawl-delay respect — if your robots.txt specifies a crawl delay, the crawler honors it
Allowlisting GreadmeBot
If your firewall or bot protection blocks the crawl, you can allowlist GreadmeBot in your WAF or firewall rules. The crawler uses a clearly identifiable user-agent string, so it's easy to distinguish from malicious bots. See the Greadme bot documentation page for specific allowlisting instructions.
FAQ
How long does a crawl take?
Most crawls finish in a few minutes. Time scales with the number of pages and your server's response time — a 100-page crawl typically completes in 1–3 minutes, while a 500-page crawl on a slower server may take 10–15 minutes. The total crawl time is reported in your summary statistics.
Do I need a sitemap?
No. Crawl Scan discovers pages by following internal links, just like a search engine bot. A sitemap can help search engines, but it's not required for Crawl Scan to find your content. If a page isn't reachable through internal links, it's an orphan — and that's a finding worth knowing about.
What's the difference between Crawl Scan and Deep Scan?
Crawl Scan analyzes many pages broadly to surface site-wide and template-level issues. Deep Scan analyzes a single page in depth, running 190+ parameters including Lighthouse performance metrics, full schema validation, and AI-powered recommendations. Use Crawl Scan to find which pages need attention; use Deep Scan to fix your highest-value pages thoroughly.
Will Crawl Scan slow down my server?
It shouldn't. The crawler analyzes pages in controlled batches and honors any crawl-delay set in your robots.txt. For most production sites the load is comparable to a routine search-engine crawl. If your server is unusually small or under load, lower the max-pages setting or set a crawl-delay in robots.txt.
How does Crawl Scan handle JavaScript-rendered content?
Crawl Scan analyzes the HTML returned by the server. If your site relies heavily on client-side rendering and ships a near-empty initial HTML, important content and links may not be visible to the crawler — which is the same problem Googlebot has historically had with such sites. Server-side rendering or static generation gives the crawler (and search engines) the most complete picture.
What if my site has 10,000+ pages?
The current per-crawl cap is 500 pages. For very large sites, run multiple crawls starting from different sections (blog, product hub, docs root, etc.) using crawl depth and include-subdomains to control scope. Combine the exported CSVs to build a full picture, and use the patterns surfaced in each batch to prioritize template-level fixes that ripple across thousands of pages at once.
Can I crawl a competitor's site?
Crawl Scan respects robots.txt, so the crawler will only visit paths the site owner allows for bots. It will identify itself as GreadmeBot. Crawling a competitor's public pages for research is generally permitted, but you should review the target site's terms of service and avoid running aggressive or repeated scans against sites you don't own.
How often should I re-crawl my site?
Monthly is a good baseline for most sites. Re-crawl after major releases, template changes, content migrations, or domain moves. A broken link that exists for a week is a minor inconvenience; one that persists for six months is an SEO problem.
Conclusion
Site-wide problems stay hidden when you only audit one page at a time. Crawl Scan replaces guesswork with a complete map of your site — every discoverable page checked for the SEO, accessibility, and link-health issues that compound over time. Use it to surface template-level patterns and broken-link chains, then drill into specific pages with Deep Scan to fix what matters most.
