Crawl Scan: The Complete Guide to Full-Site Website Crawling and Analysis

13 min read

Why Auditing One Page Is Never Enough

Imagine you're a health inspector visiting a restaurant. You wouldn't just check the front counter and declare the place safe — you'd inspect the kitchen, the storage room, the refrigerators, the bathrooms, and every corner where problems might hide. A single clean surface tells you nothing about the overall condition.

Websites work the same way. Your homepage might be perfectly optimized, but what about the 200 other pages on your site? The blog post from 2021 with broken images? The product page with no meta description? The old landing page that returns a 404? Problems spread across a website are invisible when you only look at one page at a time.

Greadme's Crawl Scan solves this by automatically discovering and analyzing every page on your website. It follows internal links just like a search engine bot would, systematically checking each page for SEO issues, missing meta tags, accessibility problems, broken links, and more. You enter one URL and get a comprehensive health report for your entire site.

Crawl Scan at a Glance:

  • Automatic Page Discovery: Follows every internal link to find all pages — no sitemap required
  • Real-Time Progress: Watch the crawl happen live with running statistics
  • Comprehensive Issue Detection: Checks for missing alt text, meta tags, headings, broken links, and more
  • Site-Wide Perspective: See aggregated issue counts and patterns across your whole site

How Crawl Scan Works

Understanding the crawling process helps you interpret your results and configure your scans for maximum value.

Step 1: Starting the Crawl

You provide a starting URL — typically your homepage — and configure the crawl settings. The crawler begins by fetching that page and extracting every internal link it finds.

Step 2: Discovering Pages

The crawler follows each discovered internal link, visiting new pages and finding even more links. This recursive process continues until it reaches the configured page limit or has visited every discoverable page on your site. The crawler works with or without a sitemap — it systematically follows every internal link to find all content.

Step 3: Analyzing Each Page

As the crawler visits each page, it performs a series of checks on the page's content, meta tags, headings, images, and links. Every issue found is recorded with its severity, location, and type.

Step 4: Aggregating Results

Once the crawl is complete, all findings are aggregated into a comprehensive report showing site-wide patterns, issue counts, and individual page details.

robots.txt Awareness

Before crawling, the scanner reads your site's robots.txt file and respects its directives. If certain paths are disallowed, the crawler will skip them — just like Googlebot would. The results include a full robots.txt analysis showing what restrictions are in place and which paths are blocked.

What Crawl Scan Checks on Every Page

Each page the crawler visits is evaluated against a comprehensive set of SEO and content quality checks. Here's what it looks for:

Image Analysis

Missing Alt Text

Images without alt text are inaccessible to screen readers and invisible to search engines. The crawler identifies every image missing alternative text across your entire site.

Generic Alt Text

Alt text like "image1.jpg" or "photo" provides no value. The crawler detects alt text that is too generic to be useful for accessibility or SEO purposes.

Image Format Optimization

The crawler identifies images served in older formats like PNG or JPEG that could be converted to modern formats like WebP for significant file size savings.

Meta Tag Checks

Missing or Incorrect Title Tags

Pages without title tags or with titles that are too long/too short. The title tag is the single most important on-page SEO element — it appears in search results, browser tabs, and social shares.

Missing or Incorrect Meta Descriptions

Pages without meta descriptions lose control over their search result snippet. The crawler flags every page where the description is missing, too short, or too long.

Missing Canonical Tags

Without canonical tags, search engines may index duplicate versions of your pages, diluting your SEO authority across multiple URLs.

Social Media Tags

Missing Open Graph Tags

Pages shared on Facebook, LinkedIn, or other platforms without OG tags display generic, unappealing previews. The crawler identifies pages missing og:title, og:description, og:image, and other essential OG tags.

Missing Twitter Card Tags

Similar to OG tags, Twitter Card tags control how your content appears when shared on X/Twitter. Missing tags mean missed opportunities for engaging social previews.

Content Structure

Heading Structure (H1-H6)

The crawler validates the heading hierarchy on every page — checking for missing H1 tags, multiple H1s, skipped heading levels, and other structural issues that affect both SEO and accessibility.

Content Length

Pages with very little content may be flagged as thin content, which can negatively impact search rankings. The crawler identifies pages that may need more substantial content.

Link Health

Broken Internal Links (404 Pages)

The crawler detects pages returning 404 errors and, critically, identifies which other pages are linking to them. This lets you find and fix the broken links at their source, not just discover that dead pages exist.

Reading Your Crawl Scan Results

The results are designed to give you both a bird's-eye view of your site's health and the ability to drill down into specific issues on specific pages.

Summary Statistics

At the top of your results, you'll see six key metrics that summarize the entire crawl:

  • Total Pages Found: How many pages the crawler discovered by following links
  • Pages Analyzed: How many pages were checked (may differ from pages found if limits are reached)
  • Total Issues: The combined count of all errors and warnings found
  • 404 Pages: How many broken/dead pages were discovered
  • Clean Pages: Pages with zero issues — your "healthy" content
  • Crawl Time: How long the analysis took in seconds

robots.txt Analysis

The results include a dedicated section for your robots.txt configuration, showing:

  • Whether a robots.txt file was found
  • Disallowed paths and directories
  • Crawl delay settings
  • Any restrictions that may be preventing search engines from accessing important content

Check Your robots.txt Restrictions

It's surprisingly common for websites to accidentally block important content in robots.txt. If the Crawl Scan shows that critical pages or directories are disallowed, review your robots.txt file to ensure you're not unintentionally hiding content from search engines.

Filtering and Sorting

With potentially hundreds of pages in your results, effective filtering is essential. The Crawl Scan results include:

  • Severity filters: View all pages, or filter to show only pages with errors, warnings, passed checks, or 404 status
  • Issue type filters: Focus on specific issues like "missing alt text" or "missing H1" to see every page with that particular problem
  • URL search: Search for specific pages by URL or title
  • Sorting options: Sort by issue count, URL, or status code to prioritize your review

Page Detail View

Clicking on any page in your results opens a detailed view showing:

  • All issues found on that page, organized by severity
  • The complete heading structure (H1-H6 hierarchy)
  • Image information with alt text status and format details
  • For 404 pages: a list of all pages that link to this broken URL

The Power of "Linked From" Data

When the crawler discovers a 404 page, it doesn't just tell you the page is broken — it tells you which other pages link to it. This is critical because fixing a 404 isn't about the dead page itself (it's already gone). It's about finding and updating every link that points to it. This "linked from" data saves you the detective work of tracking down broken link sources manually.

Configuring Your Crawl

Crawl Scan offers several configuration options that let you tailor the analysis to your needs:

Maximum Pages

Control how many pages the crawler will analyze. Options range from 50 to 500 pages. For smaller sites, a lower limit is sufficient. For larger sites, increase the limit to ensure comprehensive coverage. Start with 100 pages if you're unsure — you can always run a follow-up crawl with a higher limit.

Crawl Depth

Crawl depth determines how many link-clicks deep the crawler will go from your starting page. A depth of 3 means the crawler will follow links up to three levels away from the starting URL. This is usually sufficient to discover most pages on a well-structured site.

Include Subdomains

Choose whether the crawler should also follow links to subdomains (like blog.example.com or shop.example.com). Enable this if your site uses subdomains for different sections that you want included in the audit.

Common Site-Wide Issues Crawl Scan Reveals

Certain patterns only become visible when you analyze an entire site rather than individual pages. Here are the most common site-wide issues Crawl Scan uncovers:

Template-Level Problems

Pattern: The same issue appears on dozens or hundreds of pages

What it means: When you see the same issue (like missing OG tags or duplicate H1 patterns) across many pages, it's usually caused by a template or layout component, not by individual page content. Fixing the template fixes every affected page at once.

Example: 150 out of 200 pages are missing twitter:image tags because the site's base template doesn't include Twitter Card meta tags.

Orphaned Pages

Pattern: Important pages that aren't linked from anywhere

What it means: If the crawler can't find a page by following links, search engines probably can't either. Pages that exist but aren't linked from your navigation or content are effectively invisible.

How to detect: Compare the pages the crawler found with your sitemap or CMS page list. Any pages in your CMS that weren't found during the crawl are likely orphaned.

Broken Link Chains

Pattern: Multiple pages linking to the same 404 URL

What it means: When a page is deleted or its URL changes without a redirect, every page that linked to it now has a broken link. The crawler's "linked from" data reveals these chains, showing you which active pages are directing users to dead ends.

Priority: Fix 404s with the most "linked from" pages first — they're causing the most broken user experiences.

Inconsistent Content Quality

Pattern: Newer pages are well-optimized while older pages have many issues

What it means: As teams learn and improve their practices, newer content is often better optimized. But older content doesn't improve on its own. The crawl reveals which sections of your site have been "left behind" and need attention.

Exporting Your Crawl Results

For larger sites, working with crawl results inside the browser may not be enough. Crawl Scan supports exporting your complete results to CSV format with full Unicode support for international content.

The export includes:

  • Every page URL with its HTTP status code
  • Page titles
  • Issue counts and types per page
  • Detailed issue descriptions

This is particularly valuable for development teams that work from spreadsheets or project management tools. Import the CSV into your favorite tool, assign issues to team members, and track fixes systematically.

You can also generate shareable links to your crawl results, allowing team members or clients to browse the full interactive results without needing a Greadme account.

Crawl Scan Best Practices

Start Broad, Then Focus

Run your first crawl with a generous page limit to understand the overall scope of issues. Then use the issue type filters to focus on one category at a time — fix all missing alt text first, then move to meta descriptions, then heading structure. This systematic approach is more efficient than fixing pages one by one.

Fix Templates Before Individual Pages

If you see the same issue across many pages, find the shared template or component that's causing it. A single template fix can resolve issues on hundreds of pages simultaneously. Always look for patterns before diving into individual page fixes.

Crawl Monthly

Websites change constantly — new content is added, old pages are deleted, plugins are updated, and templates evolve. Running a monthly crawl ensures you catch new issues early before they compound. A broken link that exists for a week is a minor inconvenience; one that persists for six months is an SEO problem.

Combine with Deep Scans

The most effective audit workflow combines both scan types: use Crawl Scan to identify which pages have issues across your site, then use Deep Scan on your most important pages to get the full 100+ parameter analysis including performance metrics, schema validation, and AI-powered recommendations.

Pay Attention to 404s

Broken pages are one of the most damaging issues for both user experience and SEO. Every 404 page is a dead end for users and a wasted opportunity for search engines. Use the "linked from" data to fix these systematically — either by restoring the content, setting up redirects, or updating the links on pages that point to them.

Understanding the Crawler's Behavior

Greadme's crawler is designed to behave responsibly and respectfully toward the websites it analyzes:

  • robots.txt compliance: The crawler respects robots.txt directives and will not access paths that are disallowed
  • Identified user-agent: The crawler identifies itself as GreadmeBot with a link to documentation, making it easy for site owners to recognize and allowlist
  • Rate limiting: Pages are analyzed in controlled batches to avoid overwhelming your server with requests
  • Crawl delay respect: If your robots.txt specifies a crawl delay, the crawler honors it

Allowlisting GreadmeBot

If your site's firewall or bot protection blocks the crawl, you can allowlist GreadmeBot by adding it to your WAF or firewall rules. The crawler uses a clearly identifiable user-agent string, making it easy to distinguish from malicious bots. Visit the Greadme bot documentation page for specific allowlisting instructions.

Conclusion: See Your Entire Website Clearly

The most dangerous website problems are the ones you don't know about. A homepage that looks perfect might coexist with dozens of broken links, hundreds of images missing alt text, and content pages that search engines can't properly understand. Without crawling your entire site, these issues remain invisible — silently eroding your SEO, accessibility, and user experience.

Crawl Scan transforms website maintenance from a guessing game into a data-driven process. By automatically discovering every page and systematically checking each one for issues, it gives you the complete picture that individual page audits can never provide.

The pattern-level insights are particularly powerful. When you can see that 60% of your pages are missing Open Graph tags, you know it's a template issue. When you discover that a deleted product page has 15 other pages linking to it, you know exactly where to focus your fixes. These site-wide patterns are only visible through systematic crawling.

Start with a crawl. Understand the landscape. Fix the biggest issues first. Then keep crawling regularly to maintain the health of your site as it grows and changes. Your website is alive — it needs regular check-ups, not just one-time fixes.

Ready to discover what's hiding across your website?

Run a Crawl Scan to automatically discover and analyze every page on your site. Find missing alt tags, broken links, SEO issues, and more — all in one comprehensive report with real-time progress tracking.

Start Your First Crawl Scan