๐ Test a URL Against Your Rules
๐ Detected Sitemaps
Understanding robots.txt for SEO
The robots.txt file is a plain text file placed at the root of your website (e.g., https://example.com/robots.txt) that instructs search engine crawlers which pages or sections of your site they should or shouldn't crawl. It's part of the Robots Exclusion Protocol, a voluntary standard that major crawlers including Googlebot, Bingbot, and others respect.
A critical misunderstanding many site owners have is that Disallow in robots.txt prevents pages from being indexed. It does not โ it only prevents crawling. A page that's linked to from other pages can still appear in search results even if it's disallowed in robots.txt. To prevent indexing, you need the noindex meta tag or X-Robots-Tag header, which requires the page to be crawlable in the first place.
Common legitimate uses of robots.txt include blocking admin areas, preventing duplicate content from parameter-based URLs, reducing crawl budget waste on thin pages like login pages and internal search results, and blocking development or staging content from being indexed prematurely.
Robots.txt Best Practices
- Always include your sitemap โ Add a
Sitemap:directive pointing to your XML sitemap. This helps crawlers discover all your important pages efficiently. - Don't block CSS/JS โ Google needs to render your pages to assess them. Blocking CSS, JavaScript, or image resources prevents Google from seeing your pages as users do, which can harm rankings.
- Crawl-delay is mostly ignored โ Google ignores the Crawl-delay directive. Use Google Search Console to manage crawl rate instead.
- Test before deploying โ Always test your robots.txt changes against important URLs before publishing. Accidentally blocking your entire site (Disallow: /) is a common and devastating mistake.