Updated February 27, 2026
TL;DR: A technical SEO audit checks whether bots can access, render, and understand your site. Fix crawlability and indexation first, because content optimization is wasted if bots cannot reach your pages. Work through site architecture, Core Web Vitals, and structured data in that order. Use the ICE framework to prioritize by revenue impact, not generic warning counts. And in 2026, blocking AI crawlers like GPTBot or ClaudeBot is effectively blocking your own pipeline:
47% of B2B buyers now use AI specifically for vendor discovery and shortlisting.
Your site might rank on page one of Google for 40 target keywords and still be completely invisible to ChatGPT, Claude, and Perplexity when a prospect asks for a vendor recommendation. That gap is almost always a technical problem, not a content problem. Technical debt, blocked AI bots, poorly structured pages, and missing schema mean your content never gets retrieved, no matter how well-written it is.
This guide walks you through a complete four-phase technical SEO audit built for the dual-engine reality of 2026: Google's crawler and the AI retrieval systems reshaping B2B buying. Each phase includes a prioritized checklist, specific fixes, and the diagnostic tools you need to execute the work immediately.
Why technical SEO is the foundation of AI visibility
Search behavior in B2B has shifted faster than most teams anticipated. 89% of B2B buyers have adopted generative AI, and nearly half now use it specifically for market research and vendor discovery, with Forrester projecting AI-generated traffic to exceed 20% of total organic traffic by the end of 2025. The implication is direct: if AI systems cannot crawl, render, or understand your pages, your brand does not appear in those answers.
Traditional SEO optimized for one crawler (Googlebot) and one goal (SERP ranking). The modern audit must account for a second audience: LLM retrieval systems that pull passage-level content from your pages to synthesize answers.
Think of it this way. Google ranks pages. AI systems retrieve passages. Two very different mechanics, but the same prerequisite: the bot must be able to reach and read the page in the first place.
Technical errors that frustrate this process include:
- Blocked crawlers:
robots.txt rules that accidentally block AI bots like GPTBot or ClaudeBot, preventing citation entirely. - JavaScript rendering failures: Content that only loads after client-side JS execution is invisible to bots that do not fully render pages.
- Missing entity markup: Pages with no schema give AI systems no structured signal about what the content is, who published it, or how to attribute it.
- Dirty sitemaps: Sitemaps containing 404s or redirects waste crawl budget and reduce indexation confidence across both traditional and AI crawlers.
Our competitive technical SEO audit guide covers how to benchmark your infrastructure against competitors specifically for AEO gaps. This pillar gives you the full execution framework to act on those findings.
You do not need a dozen tools. You need the right four or five, each covering a distinct diagnostic layer.
| Tool |
Best for |
Price |
Key features for this audit |
| Screaming Frog SEO Spider |
Deep site crawls, custom extraction |
£199/year |
Structured data validation, crawl depth mapping, List Mode for sitemaps |
| Google Search Console |
Google's live view of crawl and index status |
Free |
Core Web Vitals field data, Page Indexing report, URL Inspection |
| Ahrefs Site Audit |
All-in-one with backlink context |
$129+/month |
Crawl comparison, orphan page detection, redirect chains |
| Sitebulb |
Visual crawl reports, prioritized issue flags |
$35+/month |
Hint-based fix priority, rendering snapshot comparison |
| Google Rich Results Test |
Schema validation and rich result eligibility |
Free |
Entity recognition confirmation, structured data error flagging |
Start with Google Search Console because it reflects Google's actual view of your site, not a simulation. Then use Screaming Frog for deeper on-page analysis. The Rich Results Test is non-negotiable for the schema work in Phase 4.
Phase 1: Crawlability and indexation
You must fix crawlability before touching anything else. If a bot cannot reach a page, no optimization on that page matters. Resolve every issue in Phase 1 before moving to subsequent phases.
How to identify and fix crawl errors
Open GSC and navigate to Indexing > Pages. This report shows you the full split between indexed and non-indexed pages, along with the reasons Google excluded each URL.
The three error types to prioritize:
- 404 (Not Found): A 404 on a high-value page (pricing, product, or solution pages) directly threatens revenue. Google's HTTP status code documentation confirms persistent 404s get removed from the index. Crawl your site with Screaming Frog, filter for 4xx responses, and check the "Inlinks" tab to identify which pages are sending users and bots to dead ends. Fix by either restoring the content or implementing a 301 redirect to the closest relevant live page.
- 5xx (Server Errors): These are the highest-urgency errors in any audit. Server errors tell Google your site is unreliable, which reduces your crawl budget and can result in page removal from the index within days. Address server stability, check CDN configuration, and monitor via GSC's Server Errors filter. Treat any 5xx on a revenue-critical page as an emergency requiring same-day resolution.
- Soft 404s: A soft 404 sends a 200 status code but displays no meaningful content. They waste crawl budget and can confuse search engines because the page appears "found" but contains nothing useful. Use Screaming Frog to flag pages with thin content (under 200 words) returning 200 status codes, then either restore real content, return an accurate 404, or implement a 301 redirect.
Checklist: Crawl errors
- Pull the Page Indexing report from GSC and export all excluded URLs
- Crawl the site with Screaming Frog and filter for 4xx and 5xx responses
- Identify 404s on pages with inbound internal links and fix with 301s or content restoration
- Resolve all 5xx errors immediately (within 24 hours for revenue-critical pages)
- Flag soft 404s using content-length filters and rebuild, redirect, or return accurate 404s
Optimizing robots.txt for search and AI bots
Your robots.txt file is where a critical and often overlooked mistake lives: accidentally blocking the AI crawlers that determine whether your brand appears in AI-generated answers. For a B2B SaaS company that wants AI citations, the correct approach is to allow all major AI bots access to your public-facing content.
The major AI crawler agents you need to know and explicitly allow are:
| Crawler |
Company |
Purpose |
| GPTBot |
OpenAI |
Model training and search citations |
| OAI-SearchBot |
OpenAI |
ChatGPT search functionality |
| ChatGPT-User |
OpenAI |
User-triggered browsing |
| ClaudeBot |
Anthropic |
Chat citation fetch |
| anthropic-ai |
Anthropic |
Model training |
| PerplexityBot |
Perplexity |
Index building |
| Google-Extended |
Google |
Gemini and Vertex AI training |
| CCBot |
Common Crawl (open web archive) |
Dataset collection |
You can verify the official GPTBot user-agent string directly from OpenAI's documentation: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot).
The correct robots.txt syntax, per Google's official robots.txt specification, follows this pattern:
# Allow Googlebot full access
User-agent: Googlebot
Allow: /
# Allow OpenAI crawlers
User-agent: GPTBot
Allow: /
User-agent: OAI-SearchBot
Allow: /
# Allow Anthropic crawlers
User-agent: ClaudeBot
Allow: /
# Allow Perplexity
User-agent: PerplexityBot
Allow: /
# Allow Google AI training
User-agent: Google-Extended
Allow: /
You can still restrict specific sections if needed, such as blocking all bots from /admin/ or staging environments, while keeping all public-facing content fully accessible to AI crawlers.
Checklist: robots.txt
- Audit your current
robots.txt and verify each user-agent directive - Confirm GPTBot, OAI-SearchBot, ClaudeBot, PerplexityBot, and Google-Extended are not blocked
- Add explicit
Allow: / rules for each AI bot if none currently exist - Restrict only non-public paths (admin panels, staging) for all bots
- Use GSC's robots.txt tester to verify your live file behaves as expected
Managing XML sitemaps and crawl budget
A clean XML sitemap acts as a direct crawl signal. A dirty one, containing 404s, redirect chains, or noindexed URLs, signals low site quality and wastes crawl budget across both Google and AI crawlers.
As Screaming Frog's sitemap audit guide explains, any URL returning a non-200 status code in your sitemap should be removed. The process:
- Download your XML sitemap files and load the URL list into Screaming Frog's List Mode
- Filter for anything that is not a clean 200 status code
- Remove 404s, 301/302 redirects, noindexed pages, and canonicalized URLs pointing elsewhere
- Ensure only canonical, indexable, live pages appear in the final sitemap
Sites with large content libraries should consider splitting sitemaps by content type (blog posts, product pages, help articles) and updating each on a rolling schedule to keep crawl signals fresh.
Checklist: XML sitemaps
- Download all sitemap files referenced in
robots.txt - Crawl sitemap URLs in Screaming Frog List Mode and filter for non-200 responses
- Remove all non-indexable, redirected, or erroring URLs from the sitemap
- Confirm the sitemap is submitted in GSC and shows a "Success" status
- Set a monthly sitemap review cadence to catch newly broken or redirected pages
Phase 2: Site architecture and internal linking
Once bots can access your pages, they need to discover them efficiently. Site architecture determines how link equity flows through the site and how quickly bots reach every important asset.
Flattening site structure for better discovery
A flat architecture means any page is reachable within three clicks from the homepage. This matters for both Google and AI retrieval systems. Deep page hierarchies bury content, reduce link equity distribution, and create long crawl paths that bots may abandon before reaching your most important pages.
For B2B SaaS, the highest-value pages (pricing, solution pages, comparison pages) should sit no more than one level below the homepage. Blog posts and help articles can sit at two levels. Any content sitting at four or more levels deep receives less crawl priority and should be restructured or elevated with internal links from higher-authority pages to reduce depth to level 2 or 3.
Checklist: Site structure
- Crawl the site and check the "Crawl Depth" column in Screaming Frog for all indexable URLs
- Flag any revenue-critical pages sitting deeper than level 3
- Restructure URL hierarchies or add contextual internal links to reduce depth
- Confirm primary navigation includes direct links to your highest-priority pages
Fixing orphan pages and broken links
Orphan pages sit isolated with no internal links pointing to them from the rest of your site. They may still be indexed if they appeared in your sitemap historically, but without internal links they receive no PageRank and represent missed context for AI systems trying to build a coherent picture of your content.
To find orphan pages: crawl your site with Screaming Frog, then in the "Internal" tab filter for a blank "Crawl Depth" value. Screaming Frog assigns no crawl depth to URLs it cannot discover through internal links during the crawl, making them orphans. The fix is straightforward: add contextual internal links from related, high-authority pages. If an orphan page has no natural parent in your content structure, it is likely a candidate for consolidation or redirection.
Checklist: Internal linking
- Identify orphan pages via Screaming Frog with the blank crawl depth filter
- Add at minimum two internal links from contextually relevant pages to each orphan
- Run a broken links report (filter for 4xx in Screaming Frog's "Links" export) and fix or redirect each instance
- Review anchor text distribution to confirm descriptive, varied anchor text throughout
Phase 3: Core Web Vitals and page experience
Google uses Core Web Vitals as user experience signals, and these signals also determine whether a page loads correctly for AI bots. Pages that load slowly or block rendering can be partially or incorrectly parsed by crawlers, including AI retrieval systems that may timeout before a page fully loads.
Improving LCP, INP, and CLS
Google's Core Web Vitals thresholds require at least 75% of page visits to meet these "Good" benchmarks, as confirmed by web.dev's Vitals documentation:
| Metric |
Good |
Needs improvement |
Poor |
| Largest Contentful Paint (LCP) |
≤2.5s |
2.5s - 4.0s |
>4.0s |
| Interaction to Next Paint (INP) |
≤200ms |
200ms - 500ms |
>500ms |
| Cumulative Layout Shift (CLS) |
≤0.1 |
0.1 - 0.25 |
>0.25 |
High-impact fixes for each metric:
LCP (Largest Contentful Paint):
According to the 2026 Core Web Vitals optimization guide from Digital Applied, these four changes consistently move LCP scores the most:
- Add
fetchpriority="high" to your hero image <img> tag to preload it early in the request chain - Convert images to WebP format and implement responsive
srcset attributes - Reduce time to first byte (TTFB, the delay before your server sends the first piece of data) with caching layers and a CDN
- Eliminate render-blocking CSS and JavaScript in the
<head> by inlining critical CSS and deferring non-critical scripts
INP (Interaction to Next Paint):
- Break up long JavaScript tasks using
scheduler.yield() to keep the main thread responsive - Defer non-critical third-party scripts (chat widgets, analytics tags) until after first paint
- Apply code splitting so only the JavaScript needed for the current page loads on initial render
CLS (Cumulative Layout Shift):
As Wallaroo Media's Core Web Vitals guide explains, nearly all layout shift comes from three sources you can fix at the template level: images and videos without explicit dimensions, custom fonts with invisible-text intervals during load, and dynamic content (ads, lazy-loaded components) that pushes existing content down when it appears.
- Add explicit
width and height attributes to all images, videos, and iframes - Use
font-display: swap for custom fonts to prevent invisible-text intervals during load - Reserve space in CSS for dynamic content, ad slots, and lazy-loaded components before they appear
Checklist: Core Web Vitals
- Pull CWV field data from GSC (Experience > Core Web Vitals) for both mobile and desktop
- Use PageSpeed Insights to identify the LCP element and verify its loading priority
- Run Lighthouse on your top 10 revenue-driving pages and export the audit report
- Apply
fetchpriority="high" to hero images and preload critical fonts sitewide - Audit third-party scripts and defer or remove those with low business value
- Add explicit dimensions to all images and iframes across page templates
Mobile usability and rendering
For B2B SaaS companies running React, Angular, or Vue applications, client-side rendering (CSR, where JavaScript builds the page in the browser rather than on the server) is the most common source of AI visibility failure. Pages that require JavaScript execution to render content return an empty HTML skeleton to bots that do not run a full browser, and many AI crawlers fall into this category.
The solution is server-side rendering (SSR) or dynamic rendering: serving pre-rendered HTML to bots while the JavaScript application handles the user experience in the browser. This is the single highest-impact fix for teams running CSR-heavy frameworks.
Additionally, confirm mobile usability by:
- Checking GSC under Experience > Mobile Usability for any flagged pages
- Testing key landing pages using Chrome DevTools' device emulation to confirm text is legible without horizontal scrolling
- Ensuring tap targets (buttons, links) are at minimum 48px x 48px to pass Google's mobile-friendliness evaluation
Checklist: Mobile and rendering
- Confirm whether your site uses CSR, SSR, or static site generation (SSG)
- If CSR is in use, implement SSR or dynamic rendering for at minimum your product, pricing, and solution pages
- Check GSC Mobile Usability report for active errors and prioritize fixes on revenue-critical templates
- Verify that
<meta name="viewport"> is present on every page template - Test rendering of key pages using GSC's URL Inspection tool (which fetches as Googlebot) and compare the rendered screenshot to the live browser view
Phase 4: Structured data and entity markup
Structured data is the single biggest differentiator between sites that get cited by AI systems and sites that do not. Schema markup creates machine-readable definitions of what your content is about, who published it, and what relationships exist between concepts. AI retrieval systems use these signals to extract self-contained, credible content blocks for citation.
This work maps directly to two components of Discovered Labs' CITABLE framework: "E" (Entity graph and schema), which focuses on defining explicit entity relationships in copy and markup, and "B" (Block-structured for RAG), which organizes content into clean, citable passages of 200-400 words using tables, ordered lists, and FAQs that retrieval systems can extract and surface.
Implementing Schema.org for B2B SaaS
Organization schema is foundational for any B2B SaaS site. It defines your company's identity, contact information, social profiles, and logo in a machine-readable format, giving AI models a reliable source of truth about who you are. Implement it sitewide in the <head> of your homepage and global template.
Priority schema types for B2B SaaS:
- Organization: Include
name, url, logo, sameAs (LinkedIn, G2, Crunchbase profiles), and contactPoint. This establishes your entity identity across all AI knowledge bases. - SoftwareApplication or WebApplication: Apply to product pages. As Dan Taylor's SaaS schema guide notes,
WebApplication includes browserRequirements, which distinguishes cloud-based SaaS from installed software. Include offers (pricing tiers), applicationCategory, and aggregateRating if reviews are present. - FAQPage: Apply to solution pages, comparison pages, and blog posts containing Q&A sections. FAQPage schema increases SERP real estate and gives AI systems clearly structured question-answer pairs to extract directly as cited passages.
- HowTo: Apply to guides and tutorial content. Each step should include
name, text, and optionally image. This directly supports passage retrieval because each step functions as a discrete, citable unit. - Article or BlogPosting: Apply to all editorial content. Include
author, datePublished, dateModified, publisher, and headline. The dateModified field is particularly important for AI systems that weight recency when selecting citation sources.
Place the JSON-LD script in the <head> section of your homepage template, after the opening <head> tag and before any other schema markup. Here is a minimum viable example for Organization schema:
{
"@context": "https://schema.org",
"@type": "Organization",
"name": "Your Company Name",
"url": "https://yoursite.com",
"logo": "https://yoursite.com/logo.png",
"sameAs": [
"https://linkedin.com/company/yourcompany",
"https://g2.com/products/yourcompany"
],
"contactPoint": {
"@type": "ContactPoint",
"contactType": "sales",
"email": "[email protected]"
}
}
Checklist: Structured data
- Implement Organization schema sitewide on the homepage and global template
<head> - Add SoftwareApplication or WebApplication schema to all product and feature pages
- Add FAQPage schema to any page containing Q&A sections
- Add HowTo schema to all guide and tutorial content
- Add Article/BlogPosting schema with
dateModified to all editorial content - Confirm all schema includes
author and publisher entities for authorship signals
Testing rich results and entity recognition
After implementation, validate every schema type using Google's Rich Results Test. Enter your URL and confirm:
- No errors (red items block rich result eligibility and reduce AI confidence signals)
- No warnings (yellow items reduce the completeness of entity definitions)
- All required properties are present for each schema type
Run the test on a representative sample: homepage, one product page, one blog post, and one FAQ page. Then check GSC under Enhancements to see sitewide schema coverage and any manual actions against your structured data.
Checklist: Schema testing
- Run Rich Results Test on the homepage, product pages, and top-traffic blog posts
- Resolve all errors and address high-priority warnings
- Check GSC Enhancements tab for sitewide schema coverage and error counts
- Re-test after any template changes that could break schema inheritance
- Schedule schema validation as part of any site redesign or CMS migration QA process
How to prioritize audit findings for business impact
Every technical audit produces a long list of issues. Most in-house teams cannot fix everything at once, and not every fix moves the revenue needle. The only rational approach is prioritization by impact, not by issue count or alphabetical order of audit warnings.
The ICE framework for technical SEO
The ICE framework scores each technical fix on three dimensions. As Novos' ICE scoring guide for SEO explains, you score each issue on:
- Impact (1-10): How much will fixing this issue improve revenue-relevant outcomes? Consider pages affected, potential traffic uplift, and pipeline impact.
- Confidence (1-10): How certain are you that the fix will produce the predicted result, based on historical data or benchmarks?
- Ease (1-10): How easy is implementation? Score 10 for a one-line
robots.txt change, score 1 for a full SSR migration.
The formula, per SaaS Funnel Lab's ICE prioritization method, is:
ICE Score = (Impact × Confidence) ÷ Ease
Note that when Ease is low (hard to implement), the score rises, which inflates the apparent priority of difficult tasks. Use the raw scores alongside your sprint capacity to make the final call: a high-impact, low-ease fix may still belong in the next planning cycle rather than the current one.
Example scoring applied to common technical issues:
| Issue |
Impact |
Confidence |
Ease |
Priority tier |
| GPTBot blocked in robots.txt |
10 |
9 |
9 |
Immediate |
| 5xx errors on pricing page |
9 |
9 |
8 |
Immediate |
| Missing Organization schema |
9 |
8 |
7 |
This sprint |
| CSR rendering failure on product pages |
8 |
8 |
3 |
Plan (high effort) |
| Orphan blog posts from 2021 |
3 |
5 |
7 |
Backlog |
| Alt text on archive images |
2 |
4 |
8 |
Backlog |
Separating revenue-critical fixes from housekeeping
The distinction between revenue-critical and housekeeping fixes is straightforward: does the broken element sit on a page that drives demos, trials, or pipeline? If yes, it belongs in the next sprint. If no, it belongs in the backlog.
Revenue-critical pages for most B2B SaaS sites:
- Homepage
- Pricing page
- Product and solution pages
- Comparison and "vs." pages
- High-intent blog posts (buying guides, category reviews)
- Demo or trial landing pages
Fixing a broken canonical tag on a 2019 "what is X" blog post is housekeeping. Fixing a 5xx error on your pricing page is an emergency. Audit tools do not make this distinction for you, so you have to apply it manually, using the ICE framework as your scoring method and pipeline impact as your north star.
Download: Use the Technical SEO Audit Template to log every finding, assign ICE scores, and track remediation progress with your development team.
How Discovered Labs automates technical health
Technical audits give you a snapshot. Maintaining health requires continuous monitoring as your site changes daily and AI bot behavior evolves.
Our Managed Service Model audits for AI bot access alongside traditional crawl health, ensuring GPTBot, ClaudeBot, and PerplexityBot maintain clean access to your content. We implement schema automatically on every published piece using the CITABLE framework, specifically the "E" component (Entity graph and schema) and "B" component (Block-structured for RAG, with 200-400 word sections, tables, and FAQs).
Our AI Visibility Reports connect technical health directly to pipeline. You get a weekly view of your share of voice in AI answers, broken down by query cluster and compared against your top three competitors. This shows how a robots.txt fix or schema implementation translates into an increased citation rate and AI-referred MQLs tracked in your CRM.
If you need a baseline before executing this audit, request an AI Visibility Audit. We identify which technical gaps suppress your citations and provide a ranked remediation list tied to business impact.
Frequently asked questions about technical audits
How often should I run a technical SEO audit?
Run a full four-phase audit quarterly. For crawl errors and Core Web Vitals, set up weekly automated monitoring in GSC (using the email alerts feature) and Screaming Frog's scheduled crawl tool. Any 5xx error should trigger same-day response regardless of audit cadence.
Does technical SEO directly impact AI citations?
Yes. If AI bots like GPTBot or ClaudeBot cannot crawl your pages (blocked in robots.txt, CSR rendering failures, 5xx errors), they cannot retrieve or cite your content. Structured data adds an additional layer: pages with complete entity markup are significantly easier for AI retrieval systems to parse, attribute, and surface in generated answers.
What is the single most critical technical fix?
Indexation. If a bot cannot index your page, that page does not exist for that system. Start with the Phase 1 checklist to diagnose and fix exclusion issues in under an hour, and confirm AI crawlers are not blocked in robots.txt.
What is the difference between a 301 and a 302 redirect, and when does it matter?
A 301 is a permanent redirect and passes link equity to the destination URL, signaling to bots that the content has moved for good. A 302 is temporary and does not reliably transfer link equity. Use 301s for any content consolidation, URL restructuring, or outdated page removal.
How do I know if my site has a JavaScript rendering problem?
Use GSC's URL Inspection tool and click "Test Live URL." Compare the rendered screenshot to what you see in a browser. If the bot's version shows blank sections or missing content blocks, your site has a rendering problem that requires SSR or dynamic rendering to resolve.
Key technical SEO terminology
Crawl budget: The number of URLs a bot crawls on your site within a given timeframe, determined by crawl rate limits and crawl demand. Wasting budget on 404s, redirects, or blocked pages reduces the number of valuable pages that get indexed.
Canonical tag: An HTML element (<link rel="canonical">) that tells bots which version of a page is the master copy. Essential for managing duplicate content across paginated series, URL parameters, and HTTP/HTTPS variations.
Rendering: The process by which a browser or bot converts HTML, CSS, and JavaScript into a viewable page. Bots that skip JavaScript execution see only the raw HTML, missing any content loaded client-side.
Entity markup: Schema.org properties that define real-world things (organizations, products, people, events) within your content. Entity markup helps AI systems understand relationships between concepts and attribute content correctly during retrieval.
Passage retrieval: A mechanism used by search and AI systems to extract and surface specific sections of a page in response to a query, rather than ranking the entire page. Structured content with clear headings, short paragraphs, and explicit schema is significantly more likely to get retrieved as a passage and cited.
Share of voice (AI): The percentage of relevant AI-generated answers that mention or cite your brand, measured across a defined set of buyer-intent queries. It is the AI equivalent of SERP market share, and the metric that connects technical health to pipeline in the modern audit.
Ready to turn this checklist into measurable pipeline? Request your AI Visibility Audit to see which technical gaps are suppressing your citations right now. Month-to-month terms, no long-term contracts required.