Crawl Scan in Greadme: Site-Wide SEO Audit Tool

Saar Twito13 min read
Saar Twito
Saar TwitoFounder & SEO Engineer

Hi, I'm Saar - a software engineer, SEO specialist, and lecturer who loves building tools and teaching tech.

View author profile →

What is Crawl Scan?

Crawl Scan is a Greadme feature that systematically discovers and audits every page on your website by following internal links — the same way a search engine bot does. You enter one URL; it returns a site-wide report covering broken links, missing alt text, meta tag gaps, heading-structure issues, and template-level problems that only appear when you analyze the whole site.

Key Facts (TL;DR)

  • Auto-discovers pages by following internal links — no sitemap required
  • Respects robots.txt and identifies as GreadmeBot for easy allowlisting
  • Configurable: max pages 50–500, crawl depth, optional include-subdomains
  • Per-page checks: image alt text, title/description/canonical/OG/Twitter tags, heading hierarchy, content length, broken internal links
  • "Linked from" data on every 404 — find every page pointing to a dead URL
  • CSV export with full Unicode support and shareable result links

How Crawl Scan works

The crawler runs as a four-step pipeline. Understanding it helps you interpret results and configure scans for maximum value.

Step 1: Starting the crawl

You provide a starting URL — typically your homepage — and configure the crawl settings. The crawler fetches that page and extracts every internal link it finds.

Step 2: Discovering pages

The crawler follows each discovered internal link, visiting new pages and finding more links. This recursive process continues until it reaches the configured page limit or has visited every discoverable page. No sitemap is required — the crawler systematically follows links to find content.

Step 3: Analyzing each page

As the crawler visits each page, it runs checks on the content, meta tags, headings, images, and outbound links. Every issue is recorded with its severity, location, and type.

Step 4: Aggregating results

Once the crawl finishes, all findings are aggregated into a comprehensive report showing site-wide patterns, issue counts, and individual page details.

robots.txt awareness

Before crawling, the scanner reads your site's robots.txt file and respects its directives. If certain paths are disallowed, the crawler skips them — just like Googlebot would. The results include a full robots.txt analysis showing what restrictions are in place. For more on configuring this file correctly, see our guide on setting up robots.txt.

What Crawl Scan checks per page

Each discovered page is evaluated against a set of SEO and content quality checks. The table below summarizes every category and why each check matters.

CategorySpecific checkWhy it matters
ImagesMissing alt textInaccessible to screen readers and invisible to search engines
Generic alt text (e.g. "image1.jpg")Provides no accessibility or SEO value
Outdated image format (PNG/JPEG vs WebP)Modern formats deliver significant file-size savings and faster page loads
Meta tagsMissing or incorrect title tagThe single most important on-page SEO element — appears in SERPs, tabs, and shares
Missing or incorrect meta descriptionWithout it you lose control of your search-result snippet
Missing canonical tagSearch engines may index duplicate URLs and dilute SEO authority
Social tagsMissing Open Graph (og:title, og:description, og:image)Shares on Facebook/LinkedIn show generic, unappealing previews
Missing Twitter Card tagsX/Twitter shares lose engaging previews and click-through
Headings & contentHeading hierarchy (missing H1, multiple H1s, skipped levels)Affects both SEO crawling and accessibility tree
Content length / thin contentVery-low-content pages can hurt rankings
Link healthBroken internal links (404 pages)Detects 404s and reports the "linked from" pages so you can fix the source

For a deeper look at fixing 404s specifically, see our guide on how to find and fix broken links.

Reading your Crawl Scan results

The results are designed to give you both a bird's-eye view of your site's health and the ability to drill down into specific issues on specific pages.

Summary statistics

At the top of your results, six key metrics summarize the entire crawl:

  • Total pages found — how many pages the crawler discovered by following links
  • Pages analyzed — how many pages were checked (may differ from pages found if limits are reached)
  • Total issues — the combined count of all errors and warnings
  • 404 pages — broken or dead pages discovered
  • Clean pages — pages with zero issues
  • Crawl time — how long the analysis took, in seconds

robots.txt analysis

The results include a dedicated section for your robots.txt configuration, showing:

  • Whether a robots.txt file was found
  • Disallowed paths and directories
  • Crawl-delay settings
  • Any restrictions that may be preventing search engines from accessing important content

Check your robots.txt restrictions

It's surprisingly common for sites to accidentally block important content in robots.txt. If Crawl Scan flags critical paths as disallowed, review your robots.txt to make sure you're not unintentionally hiding content from search engines.

Filtering and sorting

With potentially hundreds of pages in your results, filtering is essential. The results include:

  • Severity filters — view all pages, or filter by errors, warnings, passed checks, or 404 status
  • Issue-type filters — focus on a single issue (e.g. "missing alt text") across every affected page
  • URL search — search by URL or page title
  • Sorting — sort by issue count, URL, or status code

Page detail view

Clicking any page opens a detail view with:

  • All issues found on the page, organized by severity
  • The complete heading structure (H1–H6 hierarchy)
  • Image information with alt-text status and format details
  • For 404 pages: a list of every page that links to this broken URL

The power of "linked from" data

When the crawler discovers a 404, it doesn't just tell you the page is broken — it tells you which other pages link to it. Fixing a 404 isn't about the dead page itself (it's already gone); it's about updating every link that points to it. This data saves the detective work of tracking down broken-link sources manually.

Configuring your crawl

Crawl Scan offers three configuration options that let you tailor the analysis to your site:

Maximum pages

Control how many pages the crawler will analyze. Options range from 50 to 500. Smaller sites need a lower limit; larger sites should bump the cap to ensure coverage. Start at 100 if you're unsure — you can always re-run with a higher limit.

Crawl depth

Crawl depth determines how many link-clicks deep the crawler will go from your starting page. A depth of 3 means the crawler follows links up to three levels from the starting URL — usually enough to discover most pages on a well-structured site.

Include subdomains

Choose whether the crawler should also follow links to subdomains (like blog.example.com or shop.example.com). Enable this if your site uses subdomains for sections you want included in the audit.

Common site-wide issues only crawls reveal

Certain patterns only become visible when you analyze an entire site rather than individual pages. Here are the most common site-wide issues Crawl Scan uncovers.

Template-level problems

Pattern: the same issue appears on dozens or hundreds of pages.

What it means: when you see the same issue (missing OG tags, duplicate H1 patterns) across many pages, it's usually caused by a shared template or layout component, not by individual page content. Fixing the template fixes every affected page at once.

Example: 150 of 200 pages are missing twitter:image tags because the base template doesn't include Twitter Card meta.

Orphaned pages

Pattern: important pages that aren't linked from anywhere.

What it means: if the crawler can't find a page by following links, search engines probably can't either. Pages that exist but aren't linked from your navigation or content are effectively invisible.

How to detect: compare the pages the crawler found with your sitemap or CMS page list. Pages in the CMS that weren't found during the crawl are likely orphaned.

Broken-link chains

Pattern: multiple pages linking to the same 404 URL.

What it means: when a page is deleted or its URL changes without a redirect, every page that linked to it now has a broken link. The crawler's "linked from" data reveals these chains.

Priority: fix 404s with the most "linked from" pages first — they cause the most broken user experiences.

Inconsistent content quality

Pattern: newer pages are well-optimized while older pages have many issues.

What it means: as teams learn and improve their practices, newer content is often better optimized — but older content doesn't improve on its own. The crawl reveals which sections of your site have been left behind.

Exporting your crawl results

For larger sites, working with results inside the browser may not be enough. Crawl Scan supports exporting your complete results to CSV with full Unicode support for international content. The export includes every page URL with HTTP status code, page titles, issue counts and types per page, and detailed issue descriptions.

You can also generate shareable links to your crawl results, allowing teammates or clients to browse the full interactive report without needing a Greadme account.

Crawl Scan best practices

Start broad, then focus

Run your first crawl with a generous page limit to understand the overall scope. Then use the issue-type filters to focus on one category at a time — fix all missing alt text first, then meta descriptions, then heading structure. This is more efficient than fixing pages one by one.

Fix templates before individual pages

If you see the same issue across many pages, find the shared template or component causing it. A single template fix can resolve issues on hundreds of pages at once. Always look for patterns before diving into individual fixes.

Crawl monthly

Websites change constantly — new content is added, old pages are deleted, plugins update, templates evolve. Monthly crawls catch new issues early before they compound.

Combine with Deep Scan

The most effective workflow combines both scan types: use Crawl Scan to identify which pages have issues, then run Deep Scan on your most important pages to get the full 100+ parameter analysis including performance metrics, schema validation, and AI-powered recommendations.

Pay attention to 404s

Broken pages hurt both user experience and SEO. Use the "linked from" data to fix them systematically — restore the content, set up redirects, or update the source links.

Crawler behavior and the GreadmeBot allowlist

Greadme's crawler is designed to behave responsibly toward the sites it analyzes:

  • robots.txt compliance — the crawler respects robots.txt directives and skips disallowed paths
  • Identified user-agent — the crawler identifies itself as GreadmeBot with a link to documentation, so site owners can recognize and allowlist it
  • Rate limiting — pages are analyzed in controlled batches to avoid overwhelming your server
  • Crawl-delay respect — if your robots.txt specifies a crawl delay, the crawler honors it

Allowlisting GreadmeBot

If your firewall or bot protection blocks the crawl, you can allowlist GreadmeBot in your WAF or firewall rules. The crawler uses a clearly identifiable user-agent string, so it's easy to distinguish from malicious bots. See the Greadme bot documentation page for specific allowlisting instructions.

FAQ

How long does a crawl take?

Most crawls finish in a few minutes. Time scales with the number of pages and your server's response time — a 100-page crawl typically completes in 1–3 minutes, while a 500-page crawl on a slower server may take 10–15 minutes. The total crawl time is reported in your summary statistics.

Do I need a sitemap?

No. Crawl Scan discovers pages by following internal links, just like a search engine bot. A sitemap can help search engines, but it's not required for Crawl Scan to find your content. If a page isn't reachable through internal links, it's an orphan — and that's a finding worth knowing about.

What's the difference between Crawl Scan and Deep Scan?

Crawl Scan analyzes many pages broadly to surface site-wide and template-level issues. Deep Scan analyzes a single page in depth, running 190+ parameters including Lighthouse performance metrics, full schema validation, and AI-powered recommendations. Use Crawl Scan to find which pages need attention; use Deep Scan to fix your highest-value pages thoroughly.

Will Crawl Scan slow down my server?

It shouldn't. The crawler analyzes pages in controlled batches and honors any crawl-delay set in your robots.txt. For most production sites the load is comparable to a routine search-engine crawl. If your server is unusually small or under load, lower the max-pages setting or set a crawl-delay in robots.txt.

How does Crawl Scan handle JavaScript-rendered content?

Crawl Scan analyzes the HTML returned by the server. If your site relies heavily on client-side rendering and ships a near-empty initial HTML, important content and links may not be visible to the crawler — which is the same problem Googlebot has historically had with such sites. Server-side rendering or static generation gives the crawler (and search engines) the most complete picture.

What if my site has 10,000+ pages?

The current per-crawl cap is 500 pages. For very large sites, run multiple crawls starting from different sections (blog, product hub, docs root, etc.) using crawl depth and include-subdomains to control scope. Combine the exported CSVs to build a full picture, and use the patterns surfaced in each batch to prioritize template-level fixes that ripple across thousands of pages at once.

Can I crawl a competitor's site?

Crawl Scan respects robots.txt, so the crawler will only visit paths the site owner allows for bots. It will identify itself as GreadmeBot. Crawling a competitor's public pages for research is generally permitted, but you should review the target site's terms of service and avoid running aggressive or repeated scans against sites you don't own.

How often should I re-crawl my site?

Monthly is a good baseline for most sites. Re-crawl after major releases, template changes, content migrations, or domain moves. A broken link that exists for a week is a minor inconvenience; one that persists for six months is an SEO problem.

Conclusion

Site-wide problems stay hidden when you only audit one page at a time. Crawl Scan replaces guesswork with a complete map of your site — every discoverable page checked for the SEO, accessibility, and link-health issues that compound over time. Use it to surface template-level patterns and broken-link chains, then drill into specific pages with Deep Scan to fix what matters most.