Robots.txt Best Practices and Costly Mistakes

July 5, 2026

The most expensive robots.txt mistake is a single stray character: Disallow: / blocks your entire site from search engines, while an empty Disallow: allows everything — and the difference is one slash. Robots.txt is small but unforgiving, so this guide focuses on the habits and checks that keep it from quietly wrecking your visibility.

Understand what robots.txt actually controls

The number-one conceptual error is treating robots.txt as a way to hide pages from search results. It controls crawling, not indexing. A URL you disallow can still appear in Google — without a snippet — if other sites link to it, because Google never needed to crawl the page to know it exists. If your goal is to keep a page out of results, do the opposite of blocking it: allow crawling and add a noindex meta tag, or put the page behind authentication. Reserve robots.txt for managing crawl budget and steering bots away from low-value paths, not for privacy.

Avoid the classic syntax and placement traps

Mistake	Effect	Fix
`Disallow: /` left in by accident	Blocks the whole site	Use empty `Disallow:` to allow all
File in a subfolder	Ignored entirely	Place at the domain root
Same file for every subdomain	Subdomains uncovered	Give each subdomain its own file
Blocking CSS/JS	Broken rendering for bots	Allow assets Google needs

The file must live at exactly https://example.com/robots.txt — a copy in a subdirectory is invisible to crawlers. Each subdomain like blog.example.com needs its own file, since one does not cover the others. And blocking your CSS or JavaScript can stop Google from rendering pages correctly, which hurts more than it helps.

Set crawl-delay and AI-bot rules with realistic expectations

Crawl-delay is widely misunderstood. Googlebot ignores it outright — you manage Google's crawl rate in Search Console — though Bing and Yandex do honour it, so it is safe to include for them without expecting it to slow Google. For AI crawlers, disallow rules targeting user-agents like GPTBot, CCBot, anthropic-ai, ClaudeBot, Google-Extended, PerplexityBot and Bytespider signal that you do not want your content used for training. Compliance is voluntary, but reputable operators respect it. Remember that Google-Extended governs AI training specifically and does not affect your normal Google Search ranking, so blocking it will not hurt regular SEO.

Test before you publish, and always list your sitemap

Because one wrong line can deindex a site, treat every robots.txt change as production-critical: review the generated file line by line, confirm the paths are what you intend, and validate it in a robots testing tool before uploading. One easy win to include every time is the Sitemap: directive with an absolute URL — it helps all search engines discover your sitemap without manual submission, and you can list multiple sitemaps. Keeping the file minimal and intentional beats a sprawling set of rules nobody remembers the reason for.

Try the Robots.txt Generator — free and 100% in your browser.

FAQ

Will blocking a page in robots.txt remove it from Google?

Not reliably. Robots.txt stops crawling, not indexing, so a blocked URL can still show in results without a snippet if others link to it. Use a noindex tag or authentication to truly keep a page out.

What is the single most dangerous robots.txt line?

Disallow: / under a broad user-agent blocks your entire site from being crawled. Left in by mistake, it can wipe out search visibility. An empty Disallow: is the opposite and allows everything.

Does blocking Google-Extended hurt my search rankings?

No. Google-Extended only governs whether your content is used for AI training; it is separate from Search crawling. Blocking it keeps you out of training data without affecting normal ranking.

Why is my robots.txt being ignored?

The most common cause is placement — it must sit at the domain root, not in a subfolder, and each subdomain needs its own file. Check that the exact path is /robots.txt on the correct host.

Related free tools

XML Sitemap Generator — build the sitemap to reference in robots.txt.
Meta Tag Generator — add noindex and other meta directives.
Canonical Tag Generator — resolve duplicate-URL issues.
Hreflang Tag Generator — signal language and region targeting.

Built by ByteVancer

ByteTools is a free product of ByteVancer, a software and web development studio building web apps, SaaS and custom software. If your site needs technical SEO done right or a product built from scratch, explore how ByteVancer can help.

Robots.txt Best Practices and Costly Mistakes

Understand what robots.txt actually controls

Avoid the classic syntax and placement traps

Set crawl-delay and AI-bot rules with realistic expectations

Test before you publish, and always list your sitemap

FAQ

Will blocking a page in robots.txt remove it from Google?

What is the single most dangerous robots.txt line?

Does blocking Google-Extended hurt my search rankings?

Why is my robots.txt being ignored?

Related free tools

Built by ByteVancer

How to Write a robots.txt File and Block AI Bots

Robots.txt Use Cases: 7 Real Scenarios and Rules

XOR Cipher Use Cases: CTFs, Learning, and Puzzles

XOR Cipher Tips: Keys, Security, and Common Mistakes

Robots.txt Best Practices and Costly Mistakes

Understand what robots.txt actually controls

Avoid the classic syntax and placement traps

Set crawl-delay and AI-bot rules with realistic expectations

Test before you publish, and always list your sitemap

FAQ

Will blocking a page in robots.txt remove it from Google?

What is the single most dangerous robots.txt line?

Does blocking Google-Extended hurt my search rankings?

Why is my robots.txt being ignored?

Related free tools

Built by ByteVancer

Recommended reading

How to Write a robots.txt File and Block AI Bots

Robots.txt Use Cases: 7 Real Scenarios and Rules

XOR Cipher Use Cases: CTFs, Learning, and Puzzles

XOR Cipher Tips: Keys, Security, and Common Mistakes