What is Robots.txt? SEO Guide for Beginners

Learn what robots.txt means in SEO, why it matters, and how to use it to improve your search rankings.

Robots.txt is a text file placed in your website's root directory that tells search engine crawlers which URLs on your site the crawler can access. Google describes its main purpose as avoiding overloading your site with requests rather than as a way to keep a page out of Google. The file lives at yourdomain.com/robots.txt and acts as the first thing crawlers check before accessing your site. The Robots Exclusion Protocol was standardized in September 2022 as RFC 9309, which requires the rules to be accessible in a file named /robots.txt (all lowercase) at the top level of the host. While it is a simple text file, getting it wrong can have serious consequences for your SEO.

Why Robots.txt Matters for SEO

Your robots.txt file directly controls how search engines interact with your site. It can save crawl budget by preventing bots from wasting time on low-value pages like admin panels, internal search results, or staging environments. For large sites, this is critical because every page Googlebot spends time on is a page it could have spent crawling your important content instead.

Robots.txt also protects sensitive areas of your site from showing up in search results. If you have a staging server, development pages, or internal tools that should not be public, blocking them in robots.txt keeps crawlers away. However, it is important to understand that robots.txt is not a security mechanism. It is a polite request, not a locked door. Well-behaved bots respect it, but malicious ones will ignore it.

I once audited a site that had been struggling with indexing for months. Their developer had added a Disallow: / directive during a staging deployment and never removed it when the site went live. That single line told every search engine to ignore the entire website. Traffic dropped to near zero and stayed there until we caught the mistake. Always verify your robots.txt after deployments.

How Robots.txt Works

The file uses a simple syntax with three main directives. User-agent specifies which crawler the rules apply to (use * for all bots). Disallow tells crawlers which paths to avoid. Allow overrides a disallow for specific sub-paths. You can also include a Sitemap directive to point crawlers to your XML sitemap.

Here is a typical robots.txt file:

User-agent: *
Disallow: /admin/
Disallow: /search/
Disallow: /api/
Allow: /api/public/

Sitemap: https://yourdomain.com/sitemap.xml

When Googlebot arrives at your site, it fetches the robots.txt file first. It then checks the rules before requesting any page. If a URL matches a Disallow pattern, the crawler skips it.

A few hard specifications are worth knowing. Google enforces a robots.txt file size limit of 500 kibibytes, and any content after that limit is ignored, which matches the RFC 9309 requirement that crawlers parse at least 500 kibibytes. Three special characters are defined in the standard. The asterisk (*) matches zero or more of any character, the dollar sign ($) marks the end of a URL path, and the hash (#) starts a comment. When an Allow rule and a Disallow rule both match a URL, Google applies the most specific rule based on the length of the rule path, and if two rules are equally specific Google uses the least restrictive one. Crawlers should not use a cached robots.txt for more than 24 hours unless the file is unreachable, and Google generally caches it for up to 24 hours as well.

There is one critical caveat that trips up many site owners. Blocking a page in robots.txt does not remove it from Google's index. Google states plainly that robots.txt "is not a mechanism for keeping a web page out of Google." If other sites link to a disallowed URL, Google can still index that URL, although the search result will not have a description. To actually keep a page out of search results, use a noindex rule or password protection, and crucially the page must not be blocked in robots.txt or the crawler will never see the noindex directive.

How to Improve Robots.txt on Your Site

Audit your current robots.txt - Visit yourdomain.com/robots.txt right now and review every directive. Use Google Search Console's robots.txt tester to verify rules are working as intended.

Block low-value URL patterns - Identify pages that waste crawl budget like internal search result pages, filter/sort parameters, session ID URLs, and duplicate pagination patterns. Block them with specific Disallow rules.

Never block CSS or JavaScript files - Google needs to render your pages to understand them. Blocking CSS or JS files prevents Google from seeing your site as users do, which can hurt your rankings.

Include your sitemap location - Add a Sitemap directive at the bottom of your robots.txt. This helps all search engines discover your sitemap even if they have not found it through other means.

Test after every change - Use the URL Inspection tool in Google Search Console to verify that important pages are still crawlable after you modify your robots.txt. One wrong character can block thousands of pages.

Common Mistakes to Avoid

Using robots.txt to hide pages from search results: Robots.txt prevents crawling, not indexing. If other sites link to a blocked page, Google may still index the URL. Use noindex meta tags to prevent pages from appearing in results.
Blocking entire directories accidentally: A Disallow: /blog will block /blog/, /blog-archive/, and /bloggers/ because it matches any URL starting with /blog. Be precise with your patterns and always include trailing slashes when blocking directories.
Forgetting to update after site changes: When you redesign your site, change URL structures, or add new sections, review your robots.txt to make sure old rules still make sense and new sections are properly handled.

Key Takeaways

Robots.txt controls which pages search engine crawlers can access on your site. It lives at your domain root at /robots.txt.
Use it to save crawl budget by blocking low-value pages, not as a security or de-indexing tool.
Always include your sitemap URL and never block CSS or JavaScript files.
Test your robots.txt thoroughly after every change using Google Search Console.

In Practice

Imagine an ecommerce store running on example-shop.com. Faceted navigation generates thousands of low-value sort variants like /shoes?sort=price, and the checkout and account areas should never be crawled. At the same time, the team wants its product images crawled so they can rank in Google Images. Here is a robots.txt that handles all of that, using the real special characters from the standard:

User-agent: *
Disallow: /checkout/
Disallow: /account/
Disallow: /search
Disallow: /*?sort=      # block sort variants, keep canonical product URLs
Disallow: /*.json$      # block JSON endpoints via end-of-path anchor

Allow: /assets/images/  # explicitly keep product images crawlable

Sitemap: https://example-shop.com/sitemap.xml

In that file the asterisk lets Disallow: /*?sort= match any path before the ?sort= parameter, and the dollar sign in Disallow: /*.json$ anchors the match to URLs that end in .json so it does not accidentally block /data.json.html.

Now consider the precedence rule in action. Suppose you have both Disallow: /products/ and Allow: /products/featured/. For the URL /products/featured/widget, the Allow rule path is longer (17 characters) than the Disallow rule path (10 characters), so Google applies the more specific Allow rule and crawls the page. If instead you wrote Allow: /products/ and Disallow: /products/, the two paths are the same length, so Google falls back to the least restrictive rule and allows crawling. Understanding that length-based specificity wins is what separates a robots.txt that does what you intended from one that silently blocks revenue pages.

What is Crawl Budget? explains why blocking low-value URLs in robots.txt matters most for large sites.
What is Crawling? covers the discovery process that robots.txt gates before any page is fetched.
What is Googlebot? details the specific crawler whose access your robots.txt directives govern.
What is Noindex? is the correct tool for keeping a page out of search results, since robots.txt cannot do that.
What is an XML Sitemap? describes the file you point to with the Sitemap directive at the bottom of robots.txt.