Robots.txt Best Practices and Costly Mistakes
The most expensive robots.txt mistake is a single stray character: Disallow: / blocks your entire site from search engines, while an empty Disallow: allows everything β and the difference is one slash. Robots.txt is small but unforgiving, so this guide focuses on the habits and checks that keep it from quietly wrecking your visibility.
Understand what robots.txt actually controls
The number-one conceptual error is treating robots.txt as a way to hide pages from search results. It controls crawling, not indexing. A URL you disallow can still appear in Google β without a snippet β if other sites link to it, because Google never needed to crawl the page to know it exists. If your goal is to keep a page out of results, do the opposite of blocking it: allow crawling and add a noindex meta tag, or put the page behind authentication. Reserve robots.txt for managing crawl budget and steering bots away from low-value paths, not for privacy.
Avoid the classic syntax and placement traps
| Mistake | Effect | Fix |
|---|---|---|
Disallow: / left in by accident | Blocks the whole site | Use empty Disallow: to allow all |
| File in a subfolder | Ignored entirely | Place at the domain root |
| Same file for every subdomain | Subdomains uncovered | Give each subdomain its own file |
| Blocking CSS/JS | Broken rendering for bots | Allow assets Google needs |
The file must live at exactly https://example.com/robots.txt β a copy in a subdirectory is invisible to crawlers. Each subdomain like blog.example.com needs its own file, since one does not cover the others. And blocking your CSS or JavaScript can stop Google from rendering pages correctly, which hurts more than it helps.
Set crawl-delay and AI-bot rules with realistic expectations
Crawl-delay is widely misunderstood. Googlebot ignores it outright β you manage Google's crawl rate in Search Console β though Bing and Yandex do honour it, so it is safe to include for them without expecting it to slow Google. For AI crawlers, disallow rules targeting user-agents like GPTBot, CCBot, anthropic-ai, ClaudeBot, Google-Extended, PerplexityBot and Bytespider signal that you do not want your content used for training. Compliance is voluntary, but reputable operators respect it. Remember that Google-Extended governs AI training specifically and does not affect your normal Google Search ranking, so blocking it will not hurt regular SEO.
Test before you publish, and always list your sitemap
Because one wrong line can deindex a site, treat every robots.txt change as production-critical: review the generated file line by line, confirm the paths are what you intend, and validate it in a robots testing tool before uploading. One easy win to include every time is the Sitemap: directive with an absolute URL β it helps all search engines discover your sitemap without manual submission, and you can list multiple sitemaps. Keeping the file minimal and intentional beats a sprawling set of rules nobody remembers the reason for.
Try the Robots.txt Generator β free and 100% in your browser.
FAQ
Will blocking a page in robots.txt remove it from Google?
Not reliably. Robots.txt stops crawling, not indexing, so a blocked URL can still show in results without a snippet if others link to it. Use a noindex tag or authentication to truly keep a page out.
What is the single most dangerous robots.txt line?
Disallow: / under a broad user-agent blocks your entire site from being crawled. Left in by mistake, it can wipe out search visibility. An empty Disallow: is the opposite and allows everything.
Does blocking Google-Extended hurt my search rankings?
No. Google-Extended only governs whether your content is used for AI training; it is separate from Search crawling. Blocking it keeps you out of training data without affecting normal ranking.
Why is my robots.txt being ignored?
The most common cause is placement β it must sit at the domain root, not in a subfolder, and each subdomain needs its own file. Check that the exact path is /robots.txt on the correct host.
Related free tools
- XML Sitemap Generator β build the sitemap to reference in robots.txt.
- Meta Tag Generator β add noindex and other meta directives.
- Canonical Tag Generator β resolve duplicate-URL issues.
- Hreflang Tag Generator β signal language and region targeting.
Built by ByteVancer
ByteTools is a free product of ByteVancer, a software and web development studio building web apps, SaaS and custom software. If your site needs technical SEO done right or a product built from scratch, explore how ByteVancer can help.
Recommended reading
How to Write a robots.txt File and Block AI Bots
Build a valid robots.txt with crawl rules, sitemap directives and one-click presets to block AI training crawlers like GPTBot and ClaudeBot.
Robots.txt Use Cases: 7 Real Scenarios and Rules
Seven real-world robots.txt use cases β staging sites, faceted search, PDFs, AI crawlers and more β with the exact rules to copy for each situation.
XOR Cipher Use Cases: CTFs, Learning, and Puzzles
Real use cases for the XOR cipher, from CTF challenges and teaching bitwise logic to lightweight obfuscation, with concrete worked examples.
XOR Cipher Tips: Keys, Security, and Common Mistakes
Pro tips and common mistakes for the repeating-key XOR cipher: key length, reuse pitfalls, format choices, and when to switch to real encryption.