Robots.txt and Meta Robots Optimization: Control Crawling and Indexation

Updated February 25, 2026

TL;DR: Robots.txt controls which pages crawlers can access (crawling), while meta robots tags control which pages get indexed (visibility). These two tools work at different layers of the same pipeline, and conflating them causes one of the most damaging and commonly missed technical SEO errors: "Indexed, though blocked by robots.txt." For teams optimizing for AI search, the stakes are higher. Blocking AI citation crawlers like OAI-SearchBot or PerplexityBot in your robots.txt file tells the AI platforms your prospects use for vendor research to go away. Precise configuration of both files is the first step in any credible AEO strategy, and one misconfigured line can cost you weeks of pipeline.

A single misconfigured line in your robots.txt file can block the AI crawlers your prospects rely on for vendor shortlists. We've seen B2B SaaS teams lose 6-8 weeks of AI visibility because a developer left a staging-era Disallow directive live after launch, and the standard SEO crawl report showed no errors. If your CEO is asking why competitors appear in ChatGPT answers while your product doesn't, this is the first place to check.

This guide covers the exact syntax for robots.txt and meta robots tags, how the two interact and where they conflict, the specific user agents for today's major AI crawlers, and how to test your configuration before you break something in production. If your team is working to get your B2B SaaS recommended by AI search engines, the technical foundation here is what makes everything else work.

The difference between robots.txt and meta robots tags

These two tools control different layers of the crawl-and-index pipeline. Treat them as interchangeable and you cause real indexation problems.

Attribute	robots.txt	Meta robots tag
Function	Controls crawling (access)	Controls indexation (visibility)
Scope	Entire site, by path	Per-page, applied to specific URLs
Strength	Directive only - legitimate crawlers comply, others may not	When Googlebot reads the tag, Google drops the page from results
Example	`Disallow: /admin/`	`<meta name="robots" content="noindex">`

As SISTRIX explains, "a robots meta tag with the noindex instruction reliably prevents pages from appearing in search results," while robots.txt is intended to manage crawling traffic, not indexation.

This distinction creates a common and costly trap. A page blocked in robots.txt cannot be read by Googlebot, which means the noindex directive you placed in that page's HTML is also invisible to the crawler. The result is a page that's neither crawled nor reliably deindexed, often surfacing in Google Search Console as "Indexed, though blocked by robots.txt."

Google's official guidance is clear: for a noindex rule to work, "the page or resource must not be blocked by a robots.txt file, and it has to be otherwise accessible to the crawler."

Think of it this way: robots.txt is the perimeter fence, and meta robots tags are the signage inside the building. Lock the gate, and the visitor never reads the sign.

Configuring robots.txt for search and AI crawlers

Core syntax and directives

Your robots.txt file must live at yoursite.com/robots.txt. According to Google's robots.txt specification, the four primary directives are:

User-agent: Specifies which crawler the rules apply to. The value is case-insensitive. Use * to target all crawlers, except AdsBot variants, which you must name explicitly.
Disallow: Paths the specified crawler cannot access. The path value is case-sensitive, so /Admin/ and /admin/ are different paths.
Allow: Paths explicitly permitted, even within a disallowed directory. Also case-sensitive.
Sitemap: The fully qualified URL of your sitemap, including protocol and host.

A standard, production-ready configuration looks like this:

User-agent: *
Disallow: /admin/
Disallow: /staging/
Allow: /

Sitemap: https://www.example.com/sitemap.xml

Two nuances worth knowing. First, Google ignores Crawl-delay entirely, so including it for Googlebot has no effect. If Googlebot is crawling your site too aggressively, Google's reduce crawl rate documentation outlines the process for filing a special request. Second, as Cloudflare's documentation notes, wildcards (*) work in Allow and Disallow rules for path matching, giving you flexibility for pattern-based exclusions like Disallow: /*.pdf$.

Managing AI user agents

This is where most technical teams and most generalist SEO agencies fall behind, and where we see the biggest competitive gaps. There are two distinct categories of AI crawler, and treating them identically is a strategic error that generalist agencies frequently miss when auditing client sites.

Training crawlers collect content to build the knowledge base of large language models. Blocking them stops your content from being absorbed into model training data.

Search crawlers index content for AI-powered search experiences and, critically, send referral traffic back through citations when prospects ask for vendor recommendations.

As Playwire's publisher guide explains, allowing AI search crawlers while blocking training crawlers lets you maintain visibility in AI-powered search results while protecting your content from being absorbed into training datasets.

Here are the current user agents for the major AI platforms, drawn from official documentation and confirmed by momenticmarketing.com's AI crawler reference:

OpenAI: GPTBot (training), OAI-SearchBot (search and citations), ChatGPT-User (real-time requests)
Anthropic: anthropic-ai (training), ClaudeBot (training), claude-web (real-time fetching for Claude queries)
Perplexity: PerplexityBot (indexing), Perplexity-User (real-time requests)
Google: Google-Extended (Gemini and AI training)
Common Crawl: CCBot (open dataset that feeds many LLMs)

If you want to appear in AI-generated answers while controlling training data access, this configuration is your starting point:

# Allow AI search crawlers (for citations and referrals)
User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Perplexity-User
Allow: /

User-agent: claude-web
Allow: /

# Block training crawlers (optional - your decision)
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

# Traditional search engines
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

Sitemap: https://yoursite.com/sitemap.xml

Research from amicited.com highlights an important competitive gap: while many popular websites block some AI bots in robots.txt, they are not uniform in the bots they block and most only include directives for the most popular AI services. Your competitors may have already opened access to citation crawlers without fully auditing their configuration. Before you assume yours is correct, verify it against the full list above.

This connects directly to what we cover in comparing Google AI Overviews vs. ChatGPT vs. Perplexity for optimization. Each platform uses different crawlers with different citation behaviors, and your robots.txt needs to reflect that platform-level specificity.

Using meta robots tags for granular indexation control

When to use noindex and nofollow

According to Mozilla's developer documentation, the four most commonly applied meta robots directives are:

noindex: Requests the crawler not to index the page. Apply this to staging environments, internal search results pages, PPC landing pages, admin and login pages, and thin or duplicate content that would dilute topical authority.
nofollow: Tells crawlers not to follow links on that page. Use this for pages with untrusted user-generated content, where passing link equity to unknown destinations is a risk.
noarchive: Prevents search engines from caching the page content, appropriate for pages with time-sensitive or confidential information.
nosnippet: Prevents any description or preview from appearing in search results. As Conductor's documentation explains, nosnippet also functions as a noarchive directive and prevents content from feeding into AI Overviews, though it removes meta descriptions and rich snippets as a side effect.

For HTML pages, the tag goes in the <head> section:

<meta name="robots" content="noindex, nofollow">

For non-HTML files like PDFs or video files that have no HTML <head>, use an X-Robots-Tag HTTP response header. Google's crawling and indexing documentation provides this format:

HTTP/1.1 200 OK
X-Robots-Tag: noindex

You can apply the X-Robots-Tag globally through server-level configuration (via .htaccess or httpd.conf on Apache), which is useful for bulk control across file types without modifying individual assets. Semrush's meta robots guide covers combining multiple directives in a single tag for more precise control, such as content="noindex, noarchive" for pages you want neither indexed nor cached.

Resolving conflicts between robots.txt and meta tags

When Googlebot encounters a Disallow rule in robots.txt, it stops before reaching the page and never sees the noindex directive. Google may still index the URL if it finds external links pointing to it, because as siteguru.co explains, search engines might deem a blocked page important based on link signals alone.

The resolution depends on your intended outcome:

To index the page: Remove the Disallow rule in robots.txt, verify Googlebot can access the page, then request indexing via Google Search Console's URL Inspection tool.

To deindex the page: Remove the Disallow rule, allow Googlebot to crawl the page, add <meta name="robots" content="noindex"> to the HTML <head>, and submit the URL for recrawl. SEOTesting confirms this workflow: you must allow the crawl for the noindex tag to take effect.

Robots.txt operates on voluntary compliance, not a technical barrier. Meta robots tags carry stronger, more reliable enforcement because the instruction is read after the page is accessed. For pages you want reliably removed from search results, allowing the crawl and applying meta noindex is the correct approach.

Testing and validation workflows

Never deploy robots.txt changes to production without testing them first. A misplaced / can prevent Googlebot from accessing every page on your site, and you won't always get an immediate alert.

Pre-deployment testing

Use a technical crawler like Screaming Frog SEO Spider with a custom robots.txt to simulate how Googlebot will interpret your file before it goes live. Screaming Frog surfaces every blocked URL under the "Response Codes" tab and matches each against the specific disallow rule that triggered it. Yoast's robots.txt guide recommends using Google Search Console's Settings panel for post-deploy parsing error detection, while SearchEngineLand's robots.txt reference also recommends Bing Webmaster Tools for cross-engine validation.

Post-deployment validation

After launching changes, check Google Search Console's Settings page for robots.txt status reporting. The older standalone Robots.txt Tester tool has been deprecated, but the Settings panel surfaces any parsing errors Google encounters after the file goes live.

AI-specific validation

Standard tools don't check whether AI citation crawlers are specifically blocked. This is the gap most generalist agencies miss. At Discovered Labs, the AI Search Visibility Audit checks crawlability status specifically against AI citation crawlers, not just Googlebot. Pair this with ongoing monitoring using the best tools to track your brand in AI answers so you can confirm that configuration fixes translate into measurable citation improvements.

Common configuration mistakes that kill visibility

Problem: These errors are easy to introduce and slow to detect because they don't always trigger immediate alerts in Search Console. Most are discovered only after organic traffic or AI citation rates drop unexpectedly.

Impact: A misplaced Disallow: / on your live site blocks every crawler from accessing every page instantly, and you won't always get an immediate alert. Blocking CSS and JavaScript files prevents proper rendering. As IgniteVisibility explains, blocking these resources "put a blindfold on the Google Algorithms, not allowing Google's bots to see your website as it's meant to be seen by the users," which directly harms SERP and AI visibility. Leaving staging-era noindex tags live after launch is one of the most reported developer errors across technical SEO audits and one of the most avoidable.

Preventive measures: Run a pre-deployment crawl simulation before every robots.txt change. Build a post-launch checklist that verifies noindex tags are removed from production pages. Audit AI bot permissions specifically whenever you publish content you want cited in AI search responses.

Here are the four highest-impact mistakes we see consistently:

Blocking the entire site. Disallow: / under User-agent: * tells every crawler to stop accessing your site. Searchfacts confirms that this "will tell all robots and web crawlers that they are not allowed to access or crawl your site." This is almost always a development environment directive that nobody removed at launch, and it prevents Googlebot from updating its index with your content.
Blocking CSS and JavaScript. StanVentures' robots.txt guide notes that blocking CSS and JS "can prevent search engines from accurately rendering and understanding your site, potentially harming your site's visibility in SERPs." Remove any disallow directives for CSS and JS files from your production configuration.
Using robots.txt as a security tool. Listing sensitive URLs in robots.txt to hide them from crawlers has the opposite effect: it publicizes their existence. Security experts, including NIST, recommend against this practice as a security technique because it relies on "security through obscurity" rather than actual access controls. Use server-level authentication to protect sensitive paths instead.
Leaving staging directives on live sites. Search Engine Journal's coverage of common robots.txt issues identifies forgotten staging-era disallow rules as a leading cause of unexpected indexation gaps after site launches. Add a mandatory robots.txt review to every deployment checklist.

For teams evaluating whether their current agency has audited these layers properly, our comparison of AEO-specialist versus generalist SEO delivery goes into the specific gaps this creates in pipeline conversion.

How Discovered Labs handles this

Most audit tools don't surface AI user agents by default, so generalist SEO agencies verify that Googlebot can crawl your pages and move on. They rarely check whether OAI-SearchBot, claude-web, or PerplexityBot can access them.

The difference matters for pipeline. With 48% of B2B buyers already using AI platforms for vendor research, a clean Googlebot crawl status means little if your content is invisible to the crawlers powering the AI answers your prospects read. Our case study on 3x citation rates in 90 days shows that technical fixes, including robots.txt corrections, deliver measurable citation lifts within the first two weeks of an engagement.

Our CITABLE framework starts with "C": Clear entity and structure. Before we produce a single article, we verify your robots.txt file doesn't block AI citation crawlers, we confirm your high-value pages carry no legacy noindex tags, and we ensure your sitemap surfaces your most important content to every relevant bot.

If you're unsure whether your configuration is blocking AI visibility, book an AI Search Visibility Audit and we'll deliver a report showing which AI citation crawlers you're blocking, which pages carry conflicting directives, how your setup compares to your top 3 competitors, and a prioritized fix list you can hand directly to your dev team. Most clients see initial citation rate improvements within 2 weeks of implementing the fixes. If you're evaluating how this approach compares across different service models, our AEO scalability breakdown covers what a technical-first engagement looks like at different company sizes.

FAQs

Does robots.txt reliably block AI scrapers from training on my content?
No, not reliably. Robots.txt is voluntary compliance, not a technical barrier. You must name specific AI training user agents (like GPTBot, anthropic-ai, Google-Extended) to block them, and even then, determined scrapers can ignore the directive. For reliable content protection, combine robots.txt directives with server-level access controls.

How do I apply noindex to a PDF?
PDFs don't have an HTML <head>, so a meta robots tag won't work. If you need to block a PDF from indexation (for example, a gated whitepaper you want to keep behind a form), return an X-Robots-Tag: noindex HTTP response header from your server for the PDF URL. On Apache, configure this in .htaccess using a file extension match like <FilesMatch "\.pdf$">.

Why is Google still indexing a page I disallowed in robots.txt?
Google likely found external links to that URL and deemed it index-worthy based on link signals alone. To reliably remove it: remove the Disallow rule, allow Googlebot to crawl the page, add <meta name="robots" content="noindex"> to the HTML <head>, and submit for recrawl via Google Search Console URL Inspection. Siteguru.co's indexed-blocked guide covers each step in detail.

Should I allow GPTBot or block it?
GPTBot is a training crawler, not a citation crawler. Allowing it contributes your content to OpenAI's training data but doesn't directly drive AI-referred traffic or citations. To capture referrals and pipeline, focus on allowing OAI-SearchBot and ChatGPT-User, which power ChatGPT's real-time search and recommendation functionality. For a full platform comparison, see our guide to which AI platform to optimize for first.

Does Google support Crawl-delay in robots.txt?
No. Google ignores Crawl-delay entirely. If you're experiencing excessive Googlebot crawl activity, Google's reduce crawl rate documentation outlines the process for filing a special request. Bing does support Crawl-delay, so including it remains useful if Bingbot crawl volume is a concern. For context on how AI crawl-to-referral ratios compare to Google's, our Reddit and ChatGPT influence research includes data on how AI platforms retrieve and cite content differently from traditional search engines.

Key terms glossary

User-agent: The identifier string a crawler sends to declare which bot it is. In robots.txt, it specifies which bot a rule applies to. Values are case-insensitive, but path values in Disallow and Allow rules are case-sensitive.

Disallow directive: A robots.txt instruction telling a specified crawler not to access a given URL path. It operates on voluntary compliance, not a technical block.

X-Robots-Tag: An HTTP response header that applies indexation instructions (such as noindex) to any file type, including PDFs and media files that don't support HTML meta tags.

Crawl budget: The number of URLs a search engine crawler processes on your site within a given timeframe. Wasting crawl budget on low-value pages (admin areas, duplicate content, URL parameters) reduces how frequently your high-value pages get crawled and updated in the index.

Noindex: A meta robots directive instructing search engines not to include the page in their index. Requires the page to be accessible to crawlers to take effect. Blocking the page in robots.txt prevents Googlebot from reading the directive.