article

Fix Crawl Errors and Improve Crawlability: A Technical Guide for AI Visibility

Fix crawl errors and improve crawlability with this technical guide. Identify 404s, soft 404s, and server errors using GSC and Screaming Frog. Learn how to prioritize fixes, resolve redirect chains, and optimize crawl budget so AI crawlers can actually reach your content and cite it.

Liam Dunne
Liam Dunne
Growth marketer and B2B demand specialist with expertise in AI search optimisation - I've worked with 50+ firms, scaled some to 8-figure ARR, and managed $400k+/mo budgets.
February 24, 2026
12 mins

Updated February 24, 2026

TL;DR: Crawl errors don't just hurt your Google rankings, they block AI retrieval models from ingesting your content entirely. Use Google Search Console's Page Indexing report and a crawler like Screaming Frog to identify 404s, soft 404s, and 5xx server errors. Fix server errors and robots.txt blocks first, then address redirect chains and duplicate content. A technically clean site is the non-negotiable foundation for AI citation rates: when GPTBot or ClaudeBot hits a wall, they move on to a competitor who doesn't have one.

Crawlability is the bedrock of Answer Engine Optimization (AEO). When search bots and AI agents encounter friction (404s, redirect loops, server timeouts), they stop crawling. Your content becomes invisible to Google, ChatGPT, Perplexity, and every AI platform your buyers use to shortlist vendors.

This guide covers how to identify, prioritize, and fix these errors systematically. It's written for technical SEO managers and developers who need specific steps, not a primer on what SEO is.


Why crawlability failures kill your AI share of voice

Crawlability vs. indexability: what's the difference?

Technical teams often use these two terms interchangeably, but they describe two different gates your content must pass through. Crawlability is about discovery: can a bot access the page through links, sitemaps, or external references? Indexability is about inclusion: once crawled, can the page be stored and surfaced in a search index?

A page can be crawlable but not indexable (for example, if it carries a noindex meta tag). It can also be uncrawlable, which means it never reaches the indexability question at all. Both failures produce the same outcome: your content doesn't appear. But the fix for each is completely different, which is why diagnosing the correct failure mode first matters.

How crawl budget affects AI citation speed

Google defines crawl budget as the set of URLs that Googlebot can and wants to crawl, determined by crawl capacity (how fast your server responds) and crawl demand (how popular and fresh your content is). For most sites under a few thousand regularly updated pages, Googlebot manages this automatically. For larger SaaS sites with documentation, changelog pages, multiple product lines, and dynamic URL parameters, crawl budget becomes the speed limit on how fast new content reaches the index.

The AI angle makes this urgent. Google confirms that 5xx errors and server timeouts cause crawl rate to drop. For AI systems running RAG (Retrieval-Augmented Generation) pipelines, the problem compounds: if your site throws errors when an AI crawler fetches it, they don't wait. They index a competitor's cleaner site instead.

Think of technical crawl health as ingestion insurance. You cannot win AI share of voice if the bots cannot reach your content. For context on how different AI platforms handle retrieval, our comparison of Google AI Overviews vs. ChatGPT vs. Perplexity explains where those retrieval differences matter most.


How to diagnose crawl errors using GSC and Screaming Frog

You need two data sources: historical data from GSC (what has Google already encountered?) and live crawl data from a tool like Screaming Frog (what does your site look like right now?). Neither alone gives you the full picture.

Identifying soft 404s and server errors in Search Console

The Page Indexing report in Google Search Console is your starting point. Navigate to "Indexing" in the left rail, then click "Pages." You'll see a breakdown of indexed pages, excluded pages, and pages with errors. Filter by error type to isolate your highest-priority issues.

Review this report weekly for active sites and at least monthly for slower-moving ones. Active review is critical because GSC displays issues for 90 days after they are resolved, so gaps in monitoring can leave you working from stale or incomplete error data. For individual URLs, use the URL Inspection tool: paste any URL and GSC returns its last crawl date, crawl status, indexing status, and specific errors detected.

Understanding soft 404s vs. hard 404s:

A hard 404 returns the correct HTTP status code (404 Not Found), telling Googlebot definitively that the page doesn't exist. A soft 404 returns a 200 OK status while displaying error content or nothing at all. Google interprets these as pages that should return a 404, but the server disagrees.

Soft 404s consume crawl budget on valueless pages because the server signals success and Googlebot keeps returning. A large volume of soft 404s means fewer of your valid pages get crawled in each cycle. Fix soft 404s before standard 404s when prioritizing your audit.

Spotting redirect chains and orphaned pages with crawlers

After running a Screaming Frog crawl, navigate to the "Response Codes" tab and filter by "Client Error (4xx)" to surface broken internal links.

For redirect chains, the Redirect Chains report in Screaming Frog maps every hop in a chain, the total number of redirects, and whether any resolve into loops. Configure the crawler by going to Configuration > Spider > Advanced and enabling "Always Follow Redirects." Set the max redirects to at least 5 to capture longer chains that penalize crawl efficiency.

Orphaned pages (those with no internal links pointing to them) are found by cross-referencing your sitemap URLs against the crawler's discovered URL list. If a URL appears in your sitemap but has zero inbound internal links, bots navigating by following links cannot discover it. This connects directly to internal linking strategy: every page you want cited needs a clear discovery path.


Step-by-step fixes for high-impact crawl issues

Priority framework: what to fix first

Fix crawl issues in this order (for complete context, see our technical SEO audit checklist), because each category affects crawl capacity differently:

  1. Unintentional robots.txt blocks - these halt crawling entirely and take minutes to fix
  2. 5xx server errors - these directly reduce your crawl rate limit
  3. High-value 404s - pages with backlinks or historical traffic losing equity
  4. Redirect chains and loops - these waste crawl budget on every request
  5. Soft 404s - budget wasters, but lower urgency than server errors

Resolving 5xx server errors and connectivity issues

Problem: 5xx errors (500 Internal Server Error, 502 Bad Gateway, 503 Service Unavailable) signal that your server failed to deliver a response. They don't just affect individual pages: they tell Googlebot your entire server is unreliable.

Impact: Google explicitly confirms that a significant number of 5xx responses causes crawl rate to drop. For AI crawlers, the effect is more immediate. There is no retry logic or domain trust-building the way Googlebot develops over time. An AI crawler hitting consistent 5xx errors abandons the path.

Quick fix: Check your server logs for the specific error code. 503 errors often indicate an overloaded server during peak traffic, which you can address with caching, load balancing, or CDN configuration. 500 errors point to application-level code failures, so check recent deployments.

Long-term approach: Monitor TTFB (Time To First Byte) continuously. Google recommends keeping server response time below 200 milliseconds, with 100ms as the ideal target. The GSC Crawl Stats report shows your average server response time, making it easy to spot degradation before it compounds into a crawl rate reduction.

Preventive measures: Set up uptime monitoring (Pingdom, Better Uptime, or similar) with alerting for 5xx spikes. Review the GSC Crawl Stats report alongside your error data weekly.

Resolving 404s: when to redirect vs. when to let die

Problem: A 404 Not Found response tells bots the page doesn't exist. Your job is determining which fix is appropriate, not whether to fix it at all.

Impact: 404s on pages with significant backlinks bleed link equity. 404s on previously high-traffic pages create dead ends in your crawl graph, reducing the number of reachable pages from a given starting point.

Implementation steps:

  1. 301 Redirect: Use when you have equivalent or highly related content on a different URL. A 301 passes link equity to the destination and is the correct signal when content has moved permanently.
  2. 410 Gone: Use when content is permanently removed and no replacement exists. A 410 tells Googlebot to deindex faster than a standard 404, which speeds up crawl budget recovery by removing the URL from the crawl queue.
  3. Leave as 404: Correct for pages that were never important and have no backlinks. Return a true 404 rather than redirecting to an unrelated page, as this creates false content equivalence signals.

Prioritization rule: Run your 404 list through a backlink checker. Prioritize 301 redirects for URLs with at least one referring domain. Leave the rest as clean 404s with a helpful custom 404 page.

Preventive measures: Before migrating or deleting any page, check its backlink profile and internal link count. Add both redirect implementation and internal link updates to your page deletion workflow as required steps.

Fixing robots.txt blocks and sitemap discrepancies

Problem: A single Disallow: rule in robots.txt can block an entire section of your site from crawling. Unlike other crawl errors, this failure is silent: bots comply without showing an error in GSC for the blocked URLs.

Impact: Blocked resources are completely invisible to AI crawlers. AI crawling research confirms that if you see unexplained zero-crawl areas, a robots.txt misconfiguration is often the cause.

Quick fix: Open your robots.txt at yourdomain.com/robots.txt and verify it follows this structure:

# Example of a clean robots.txt
User-agent: *
Disallow: /admin/
Disallow: /checkout/
Allow: /

# Reference your sitemap
Sitemap: https://www.yourdomain.com/sitemap.xml

Implementation steps:

  1. Reference your XML sitemap as an absolute URL within robots.txt. Google, Bing, and other engines use this to discover all canonical URLs.
  2. Never Disallow CSS or JavaScript files required for page rendering. Blocking these prevents Googlebot from understanding your page layout.
  3. Test every Disallow rule with Google's robots.txt Tester in GSC before deploying to production.
  4. Your XML sitemap should contain only canonical, 200-status, indexable URLs. Remove any redirected, noindexed, or 404 URLs. Google's limit is 50,000 URLs or 50MB per sitemap file.

Preventive measures: Add robots.txt validation to your CI/CD pipeline so any deployment modifying this file triggers an automated test against a list of must-crawl URLs before it goes live.


Optimizing crawl budget for indexation speed

Managing parameters, pagination, and duplicate content

URL parameters drain more crawl budget than any other single factor on SaaS sites. Session IDs, sort orders, filter combinations, and tracking parameters (?utm_source=, ?ref=) multiply your URL count by hundreds without adding unique content.

Fix this with canonical tags pointing parameter variants back to the preferred URL. For faceted navigation (common in SaaS product directories), use Disallow in robots.txt to block specific parameter patterns, being precise rather than broad. For pagination, ensure each paginated page canonicalizes to itself rather than competing with the root page.

Handling JavaScript rendering for modern crawlers

JavaScript rendering is where most B2B SaaS sites lose their AI visibility, and the fix requires architectural changes, not quick patches.

Googlebot processes JavaScript in a two-wave process: it crawls the initial HTML first, then returns to render JavaScript when resources are available. This two-wave approach means JavaScript-dependent content can take significantly longer to be indexed, and in some cases, critical content rendered client-side gets missed entirely.

For AI crawlers, there is no second wave. GPTBot, ClaudeBot, and Perplexity's crawler do not execute JavaScript. Vercel and MERJ tracked over half a billion GPTBot fetches and found zero evidence of JavaScript execution, even when GPTBot downloaded JS files. ClaudeBot, Meta's ExternalAgent, and ByteDance's Bytespider behave identically.

Any content your buyers would encounter in an AI answer that currently renders client-side is invisible to AI retrieval systems. This includes product feature descriptions, dynamic FAQ sections, pricing tables built in React, and documentation loaded via API.

Implementation steps:

  1. Audit which content on your key commercial and product pages loads via client-side JavaScript vs. appearing in the initial HTML response.
  2. Move critical answer content (what you do, who you serve, key differentiators) into the static HTML using server-side rendering (SSR) or static site generation.
  3. For React/Next.js stacks, enabling SSR for product and landing pages is the highest-ROI technical change for AI visibility.

Test what AI crawlers see by running this command:

curl -A "GPTBot" https://yoursite.com/your-page

What's in that raw HTML is all they get.

This connects directly to the GEO vs. SEO distinction: what works for Google rankings doesn't automatically translate to AI citation. SSR is one of the clearest technical illustrations of that gap.

Mobile-friendliness and mobile-first indexing

Google uses your mobile site as the primary version for indexing and ranking. If your mobile experience has significantly less content than desktop (collapsed sections, hidden tabs, deferred images), Google indexes a stripped-down version. This reduces content available for AI ingestion.

Run the Mobile Usability report in GSC to catch pages failing mobile requirements. Fix viewport configuration, font sizing, and tap target spacing at the template level, not page-by-page.


The AI difference: structuring for LLM retrieval

Making your site crawlable is necessary but not sufficient for AI citations. The CITABLE framework identifies the technical foundation as "C - Clear entity and structure." Your content structure must work for RAG systems, not just traditional indexing.

RAG systems retrieve content in passages, not whole pages. They look for discrete, well-formed HTML blocks that answer a specific question, and they need those blocks in clean, accessible HTML from the initial page load.

Structural requirements for LLM retrieval:

  • Use standard HTML tags (<p>, <li>, <table>, <h2>, <h3>) for all answer content. Never hide key information behind JavaScript-loaded tabs or accordion elements requiring user interaction.
  • Open every major section with a BLUF (Bottom Line Up Front) sentence in a <p> tag that directly answers the section's question. This is the passage a RAG system is most likely to extract.
  • Keep answer blocks between 200 and 400 words. Shorter blocks are easier for retrieval systems to match to a specific query.
  • Include FAQ schema on pages that directly answer buyer questions. Structured data tells AI systems what a question-answer pair looks like, making extraction more reliable.
  • Ensure your entity data (company name, product names, integrations, pricing range, customer types) appears consistently in static HTML across your most important pages.

Our B2B SaaS case study on 6x AI-referred trials shows how technical accessibility and content structure work together to produce measurable citation rates. For context on why traditional SEO agencies miss this entirely, the 7 mistakes SEO agencies make on AI visibility is a useful companion read.

Our AI Visibility Audit at Discovered Labs starts by mapping which of your key product and category pages AI crawlers can actually reach and parse. We cross-reference GSC error data with live AI bot behavior to identify gaps that standard SEO tools miss, particularly the JavaScript rendering gap that silently blocks entire content sections from AI retrieval. You get a prioritized fix list tied to citation potential, not just Google rankings.


Maintenance checklist: keeping your site crawl-healthy

Crawl health isn't a one-time fix. New pages, CMS updates, redirect accumulation, and JavaScript framework changes create technical debt continuously. Build these checks into your regular workflow:

Monthly checks:

  • Review the GSC Page Indexing report for new error spikes
  • Run a full site crawl to catch new 404s, chains, and orphaned pages
  • Verify robots.txt for unintended changes (especially after CMS or platform updates)
  • Check server response times in the GSC Crawl Stats report and investigate any TTFB above 500ms
  • Validate that your XML sitemap contains only canonical, 200-status URLs

Quarterly checks:

  • Audit JavaScript-rendered content on key commercial pages for AI crawler visibility
  • Review internal linking to ensure priority pages have at least 3-5 inbound internal links
  • Check mobile usability for new template changes
  • Test your top 10 URLs with a raw HTTP fetch to confirm static HTML completeness

Set up email alerts in GSC settings for "New errors in Page Indexing." Getting an alert the day a 5xx spike begins beats discovering it three weeks later in a monthly review.

For ongoing AI visibility monitoring alongside crawl health, the best tools to monitor your brand in AI answers gives you a practical stack that complements GSC data with citation tracking across ChatGPT, Claude, and Perplexity.


Where to go from here

Technical crawlability is the infrastructure layer of everything else you build. A 404 on a key product page means you lose a citation from ChatGPT. A JavaScript-only FAQ section stays invisible to every AI crawler your buyers interact with. A misconfigured robots.txt locks your front door when prospects come knocking.

Fix the technical foundation first, then invest in content structure and third-party validation. The brands that show up in AI answers are the ones that have cleared both hurdles. For a broader view on how B2B SaaS companies get recommended by AI search engines, including both the technical and content dimensions, that guide covers the full picture.

If you want to know whether your technical fixes have translated to actual AI citation rates, request an AI Visibility Audit from Discovered Labs. We'll show you exactly where your site stands across ChatGPT, Claude, and Perplexity, how your citation rate compares to your top three competitors, and which technical and content gaps to address first for results within 30 days.


FAQs

How often does Googlebot crawl my site?
There's no fixed schedule: crawl frequency ranges from multiple times per day for high-authority sites to once every several weeks for smaller ones, depending on authority, update frequency, and server health. Check your site's crawl frequency in the Crawl Stats report in Google Search Console, which shows the average pages crawled per day.

What is the difference between a soft 404 and a 404?
A hard 404 returns an HTTP status code of 404 Not Found, correctly signaling to bots that the page doesn't exist. A soft 404 returns a 200 OK status while displaying error content or an empty page, so bots keep returning and consuming crawl budget on a page that delivers no value.

Do 301 redirect chains hurt SEO?
Yes, redirect chains add latency to each request and consume crawl budget, and each additional hop can result in some link equity loss across the chain. Fix chains by updating every redirect in the sequence to point directly to the final destination URL.

How does site speed affect crawl budget?
Server response time directly determines how many requests Googlebot sends per day: Google recommends a TTFB below 200ms, and servers responding at 500ms or above cause Googlebot to reduce crawl rate. For AI crawlers, slow response times and timeouts cause them to abandon the crawl path entirely with no retry.


Key terms glossary

Crawl budget: The set of URLs that Googlebot can and wants to crawl within a given time period, determined by crawl capacity (server speed and health) and crawl demand (content popularity and freshness). Relevant for any site where the speed of indexation affects competitive positioning.

Soft 404: A page that returns a 200 OK HTTP status but displays error content or no content at all. Google identifies these as pages that should return a 404, and they waste crawl budget by being repeatedly fetched without delivering value.

Orphaned page: A page that exists on your site but has no internal links pointing to it. Bots that navigate by following links (including AI crawlers) cannot discover orphaned pages unless they appear in a sitemap, making them effectively invisible from an organic crawl path.

RAG (Retrieval-Augmented Generation): The system architecture AI answer engines like ChatGPT and Perplexity use to retrieve external content and incorporate it into generated responses. If a page cannot be crawled or its content is hidden in JavaScript, it cannot contribute to RAG retrieval and will not appear in AI-generated answers.


Continue Reading

Discover more insights on AI search optimization

Jan 23, 2026

How Google AI Overviews works

Google AI Overviews does not use top-ranking organic results. Our analysis reveals a completely separate retrieval system that extracts individual passages, scores them for relevance & decides whether to cite them.

Read article
Jan 23, 2026

How Google AI Mode works

Google AI Mode is not simply a UI layer on top of traditional search. It is a completely different rendering pipeline. Google AI Mode runs 816 active experiments simultaneously, routes queries through five distinct backend services, and takes 6.5 seconds on average to generate a response.

Read article