Robots.txt File: Set up Disallow, Allow, and Sitemaps Properly

For proper SEO on your webiste, you need a well-optimized robots.txt file for ensuring that search engines can crawl and index your website effectively. The robots.txt file serves as a set of instructions for search engine crawlers, informing them about which parts of your site they are allowed to access and which parts they should avoid. In our short guide we will talk about how to set up and optimize your robots.txt file to improve your website’s SEO performance. We will cover topics such as disallow and allow rules, adding XML sitemaps to robots.txt, and understanding how Google interprets robots.txt.

Contents

The Basics of Robots.txt

Let’s start with the basics. A robots.txt file is a plain text file that resides on the root directory of your website. It follows the Robots Exclusion Protocol (REP), which is a standard that governs how search engine crawlers interact with websites. The robots.txt file consists of rules that specify which parts of your site should be crawled and which parts should be avoided by search engine crawlers.

Here is an example of a simple robots.txt file:

User-agent: Googlebot
Disallow: /nogooglebot/

User-agent: *
Allow: /

Sitemap: https://www.example.com/sitemap.xml

In this example, the robots.txt file has two rules. The first rule specifies that the Googlebot crawler is not allowed to crawl any URLs that start with /nogooglebot/. The second rule allows all other user agents (the wild card *) to crawl the entire site. You can also see that the file includes a Sitemap directive that indicates the URL of the website’s XML sitemap.

Creating and Editing Your Robots.txt File

To create or edit your robots.txt file, you can use any text editor, such as Notepad (for Windows) or Text Edit (for Mac). It’s important to save the file with UTF-8 encoding to ensure compatibility with search engine crawlers. Avoid using word processors, as they may add unexpected characters that can cause issues for crawlers.

When it comes to the location of your robots.txt file, it must be placed in the root directory of your website. For example, if your website is https://www.example.com/, the robots.txt file should be accessible at https://www.example.com/robots.txt. It is important to name and place the robots.txt file correctly, as search engine crawlers look for the robots.txt file specifically in the root directory.

Using Disallow and Allow Rules

Disallow and allow rules are the heart of the robots.txt file, as they determine which parts of your site are accessible to search engine crawlers. Let’s take a closer look at how these rules work and how you can use them to optimize your website’s crawlability.

Disallow Rules

The disallow rule specifies the URL paths that crawlers should not access. It is used to prevent search engine crawlers from indexing certain parts of your site. Here’s an example:

User-agent: *
Disallow: /account/
Disallow: /admin.asp

In this example, the disallow rule prevents all user agents from accessing the /account/ directory and the /admin.asp page. This means that search engines will not crawl or index these specific URLs.

You should also know that disallow rules are case-sensitive. For example, if you want to disallow a specific directory, such as /Account/, make sure to use the correct capitalization in your rule.

Allow Rules

The allow rule, on the other hand, specifies the URL paths that crawlers are allowed to access. It can be used to override a disallow rule and grant access to specific subdirectories or pages within a disallowed directory. Here’s an example:

User-agent: Googlebot
Disallow: /account/

User-agent: *
Allow: /account/public/

In this example, the allow rule allows the Googlebot crawler to access the /account/public/ subdirectory, even though the /account/ directory is disallowed for all other user agents. This allows you to provide access to specific content within a disallowed directory while still maintaining overall crawl restrictions.

Wildcards in Disallow and Allow Rules

Both disallow and allow rules support the use of wildcards for more flexible URL matching. The * wildcard represents any sequence of characters, while the $ wildcard indicates the end of a URL. Here are some examples:

User-agent: *
Disallow: /images/*.jpg

User-agent: Googlebot
Disallow: /blog/$

In the first example you set the rule disallowing all user agents to crawl any .jpg file in your directory /images/. In the second example, the /blog/ directory and all its subdirectories are disallowed for the Googlebot crawler only.

It’s important to use wildcards carefully, as they can have unintended consequences if not used correctly. Make sure to thoroughly test your robots.txt file to ensure that the desired URLs are properly disallowed or allowed.

Adding XML Sitemaps to Robots.txt

XML sitemaps are an essential tool for helping search engines understand the structure and content of your website. By including a reference to your sitemap in your robots.txt file, you can ensure that search engine crawlers discover and crawl your sitemap regularly.

To add a reference to your XML sitemap in your robots.txt file, use the Sitemap directive followed by the URL of your sitemap. Here’s an example:

Sitemap: https://www.example.com/sitemap.xml

In this example, the robots.txt file includes a Sitemap directive that points to the location of the website’s XML sitemap. This informs search engine crawlers about the existence and location of the sitemap, allowing them to crawl and index the pages more efficiently.

It’s important to note that you can include multiple Sitemap directives in your robots.txt file if you have multiple sitemaps for different sections or languages of your website. Each Sitemap directive should point to the specific URL of the respective sitemap.

#Sitemaps: Sitemap: https://www.example.com/sitemap.xml Sitemap: https://www.example.com/blog/sitemap_index.xml

Google’s Interpretation of Robots.txt

Google’s crawlers interpret robots.txt files to determine the crawlability of websites. It’s important to understand how Google interprets and handles robots.txt files to ensure that your website is being crawled and indexed as intended.

When requesting a robots.txt file, Google’s crawlers pay attention to the HTTP status code of the server’s response. Different status codes signal different outcomes:

2xx (success): Google’s crawlers process the robots.txt file as provided by the server.
3xx (redirection): Google follows up to 5 redirects and treats it as a 404 for the robots.txt file. Logical redirects in robots.txt files (for example: JavaScript or meta refresh-type redirects), are not followed.
4xx (client errors): Google’s crawlers treat all 4xx errors (except 429) as if a valid robots.txt file doesn’t exist. This means that Google assumes that there are no crawl restrictions. Also, Google recommends not using 401 and 403 status codes for limiting crawl rate. These two have no effect on crawl rate.
5xx (server errors): Google temporarily interprets 5xx and 429 server errors as if the site is fully disallowed. Google will try crawling the robots.txt file until it obtains a non-server-error HTTP status code. If the robots.txt file is unreachable for more than 30 days, Google will use the last cached copy. If unavailable, Google will assume there are no crawl restrictions.

As you can see, your robots.txt file should return the correct HTTP status code and should be accessible to search engine crawlers. Site owners or webmasters should have a habit of regularly testing their robots.txt file to verify its accessibility and correctness.

Optimizing Your Robots.txt File

Now that you understand the fundamentals of robots.txt optimization, let’s explore some best practices to ensure that your robots.txt file is optimized for SEO:

1. Review and Update Regularly

Regularly review and update your robots.txt file to reflect any changes in your website’s structure or content. As your site evolves, you may need to add or modify disallow and allow rules to ensure that search engine crawlers can access the relevant pages.

2. Use Descriptive Comments

Include descriptive comments in your robots.txt file to provide additional context and make it easier for other developers or website administrators to understand the purpose of specific rules. Comments start with the # character and are ignored by search engine crawlers.

#Sitemaps: Sitemap: https://www.example.com/sitemap.xml User-agent: adidxbot # Allow Bing ad bot to do it's thing Allow: *

3. Test and Validate Your Robots.txt File

Thoroughly test your robots.txt file to ensure that it is working as intended. Use tools such as the robots.txt Tester in Google Search Console or Google’s open-source robots.txt library to validate your file and check for any errors or misconfigurations.

4. Leverage Wildcards Carefully

While wildcards can be powerful tools for URL matching, use them with caution. Incorrect use of wildcards can lead to unintended blocking or allowing of URLs. Always double-check and test your rules before deploying them on your live website.

5. Monitor Crawl Errors

Regularly monitor your website’s crawl errors in Google Search Console or other SEO tools. Crawl errors can indicate issues with your robots.txt file, such as blocking important pages or directories unintentionally. Address any crawl errors promptly to ensure optimal crawlability.

6. Keep Sitemaps Up to Date

If you make changes to your website’s structure or content, update your XML sitemaps accordingly and ensure that the sitemap URLs in your robots.txt file are up to date. This helps search engine crawlers discover and index your new or updated pages more efficiently.

Final Words

Optimizing your robots.txt file is an essential part of your overall SEO strategy. By properly configuring your disallow and allow rules and including a reference to your XML sitemaps, you can improve your website’s crawlability and ensure that search engine crawlers can access and index your website effectively. Follow the above best practices to achieve a better crawlability for your site. If you need any other SEO support, include the revision of your current robots.txt, get in touch with our team!

Like this article? Don't hesitate to share it: