SaaS.Group

Posted on Aug 19, 2022

Robots.Txt Files & SEO – Best Practices, and Fixes for Common Issues

#webdev #tutorial

Originally published on Prerender.io: Robots.Txt Files & SEO – Best Practices, and Fixes for Common Issues

Technical SEO is a well-executed strategy that factors in various on-page and off-page ranking signals to help your website rank higher in SERPs. Each SEO tactic plays into the grand scheme of boosting your page rank by ensuring web crawlers can easily crawl, rank, and index your website.

From page speed to proper title tags, there are many ranking signals that technical SEO can help with. But did you know that one of the most important files for your website’s SEO is also found on your server?

The robots.txt file is a code that tells web crawlers which pages on your website they can and cannot crawl. This might not seem like a big deal, but if your robots.txt file is not configured correctly, it can have a serious negative effect on your website’s SEO.

In this blog post, we’ll be discussing everything you need to know about robots.txt, from what is a robots.txt file in SEO to the best practices to the proper way to fix common issues.

What Is a robots.txt File & Why Is It Important in SEO?

Robots.txt file is a file located on your server that tells web crawlers which pages they can and cannot access. If a web crawler tries to crawl a page that is blocked in the robots.txt file, it will be considered a soft 404 error.

Although a soft 404 error will not hurt your website’s ranking, it is still considered an error. And too many errors on your website can lead to a slower crawl rate which can eventually hurt your ranking due to decreased crawling.

If your website has a lot of pages that are blocked by the robots.txt file, it can also lead to a wasted crawl budget. The crawl budget is the number of pages Google will crawl on your website during each visit.

Another reason why robots.txt files are important in SEO is that they give you more control over the way Googlebot crawls and indexes your website. If you have a website with a lot of pages, you might want to block certain pages from being indexed so they don’t overwhelm search engine web crawlers and hurt your rankings.

If you have a blog with hundreds of posts, you might want to only allow Google to index your most recent articles. If you have an eCommerce website with a lot of product pages, you might want to only allow Google to index your main category pages.

Configuring your robots.txt file correctly can help you control the way Googlebot crawls and indexes your website, which can eventually help improve your ranking.

What Google Says About robots.txt File Best Practices

Now that we’ve gone over why robots.txt files are important in SEO, let’s discuss some best practices recommended by Google.

Create a File Named robots.txt

The first step is to create a file named robots.txt. This file needs to be placed in the root directory of your website – the highest-level directory that contains all other files and directories on your website.

Here’s an example of proper placement of a robots.txt file: on the apple.com site, the root directory would be apple.com/.

You can create a robots.txt file with any text editor, but many CMS’ like WordPress will automatically create it for you.

Add Rules to the robots.txt File

Once you’ve created the robots.txt file, the next step is to add rules. These rules will tell web crawlers which pages they can and cannot access.

There are two types of robot.txt syntax you can add: Allow and Disallow.

Allow rules will tell web crawlers that they are allowed to crawl a certain page.

Disallow rules will tell web crawlers that they are not allowed to crawl a certain page.

For example, if you want to allow web crawlers to crawl your homepage, you would add the following rule:

Allow: /

If you want to disallow web crawlers from crawling a certain subdomain or subfolder on your blog, you use:Disallow: /

Upload the robots.txt File to Your Site

After you have added the rules to your robots.txt file, the next step is to upload it to your website. You can do this using an FTP client or your hosting control panel.

If you’re not sure how to upload the file, contact your web host and they should be able to help you.

Test Your robots.txt File

After you have uploaded the robots.txt file to your website, the next step is to test it to make sure it’s working correctly. Google provides a free tool called the robots.txt Tester in Google Search Console that you can use to test your file. It can only be used for robots .txt files that are located in the root directory of your website.

To use the robots.txt tester, enter the URL of your website into the robots.txt Tester tool and then test it. Google will then show you the contents of your robots.txt file as well as any errors it found.

Use Google’s Open-Source Robots Library

If you are a more experienced developer, Google also has an open-source robots library that you can use to manage your robots.txt file locally on your computer.

What Can Happen to Your Website’s SEO if a robots.txt File Is Broken or Missing?

If your robots.txt file is broken or missing, it can cause search engine crawlers to index pages that you don’t want them to. This can eventually lead to those pages being ranked in Google, which is not ideal. It may also result in site overload as crawlers try to index everything on your website.

A broken or missing robots.txt file can also cause search engine crawlers to miss important pages on your website. If you have a page that you want to be indexed, but it’s being blocked by a broken or missing robots.txt file, it may never get indexed.

In short, it’s important to make sure your robots.txt file is working correctly and that it’s located in the root directory of your website. Rectify this problem by creating new rules or uploading the file to your root directory if it’s missing.

Best Practices for Robots.txt Files

Now that you know the basics of robots.txt files, let’s go over some best practices. These are things you should do to make sure your file is effective and working properly.

Use a New Line for Each Directive

When you’re adding rules to your robots.txt file, it’s important to use a new line for each directive to avoid confusing search engine crawlers. This includes both Allow and Disallow rules.

For example, if you want to disallow web crawlers from crawling your blog and your contact page, you would add the following rules:

Disallow: /blog/

Disallow: /contact/

Use Wildcards To Simplify Instructions

If you have a lot of pages that you want to block, it can be time-consuming to add a rule for each one. Fortunately, you can use wildcards to simplify your instructions.

A wildcard is a character that can represent one or more characters. The most common wildcard is the asterisk (*).

For example, if you want to block all files that end in .jpg, you would add the following rule:

Disallow: /*.jpg

Use “$” To Specify the End of a URL

The dollar sign ($) is another wildcard that you can use to specify the end of a URL. This is helpful if you want to block a certain page but not the pages that come after it.

For example, if you want to block the contact page but not the contact-success page, you would add the following rule:

Disallow: /contact$

Use Each User Agent Only Once

Thankfully, when you’re adding rules to your robots.txt file, Google doesn’t mind if you use the same User-agent multiple times. However, it’s considered best practice to use each user agent only once.

Use Specificity To Avoid Unintentional Errors

When it comes to robots.txt files, specificity is key. The more specific you are with your rules, the less likely you are to make an error that could hurt your website’s SEO.

Use Comments To Explain Your robots.txt File to Humans

Despite your robots.txt files being crawled by bots, humans will still need to be able to understand, maintain and manage them. This is especially true if you have multiple people working on your website.

You can add comments to your robots.txt file to explain what certain rules do. Comments must be on their line and start with a #.

For example, if you want to block all files that end in .jpg, you could add the following comment:

Disallow: /*.jpg # Block all files that end in .jpg

This would help anyone who needs to manage your robots.txt file understand what the rule is for and why it’s there.

Use a Separate robots.txt File for Each Subdomain

If you have a website with multiple subdomains, it’s best to create a separate robots.txt file for each one. This helps to keep things organized and makes it easier for search engine crawlers to understand your rules.

Common Robots.txt File Mistakes and How To Fix Them

Understanding the most common mistakes people make with their robots.txt files can help you avoid making them yourself. Here are some of the most common mistakes and how to fix these technical SEO issues.

Missing robots.txt File

The most common robots.txt file mistake is not having one at all. If you don’t have a robots.txt file, search engine crawlers will assume that they are allowed to crawl your entire website.

To fix this, you’ll need to create a robots.txt file and add it to your website’s root directory.

Robots.txt File Not in the Directory

If you don’t have a robots.txt file in your website’s root directory, search engine crawlers won’t be able to find it. As a result, they will assume that they are allowed to crawl your entire website.

It should be a single text file name that should be not placed in subfolders but rather in the root directory.

No Sitemap URL

Your robots.txt file should always include a link to your website’s sitemap. This helps search engine crawlers find and index your pages.

Omitting the sitemap URL from your robots.txt file is a common mistake that may not hurt your website’s SEO, but adding it will improve it.

Blocking CSS and JS

According to John Mueller, you must avoid blocking CSS and JS files as Google search crawlers require them to render the page correctly.

Naturally, if the bots can’t render your pages, they won’t be indexed.

Using NoIndex in robots.txt

Since 2019, the noindex robots meta tag has been deprecated and is no longer supported by Google. As a result, you should avoid using it in your robots.txt file.

If you’re still using the noindex robots meta tag, you should remove it from your website as soon as possible.

Improper Use Of Wildcards

Using wildcards incorrectly will only result in restricting access to files and directories that you didn’t intend to.

When using wildcards, be as specific as possible. This will help you avoid making any mistakes that could hurt your website’s SEO. Also, stick to the supported wildcards, that is asterisk and dollar symbol.

Wrong File Type Extension

As the name implies, a robot.txt file must be a text file that ends in.txt. It cannot be an HTML file, image, or any other type of file. It must be created in UTF-8 format. A useful introductory resource is Google’s robot.txt guide and Google Robots.txt FAQ.

Use Robot.Txt Files Like A Pro

A robots.txt file is a powerful tool that can be used to improve your website’s SEO. However, it’s important to use it correctly.

When used properly, a robots.txt file can help you control which pages are indexed by search engines and improve your website’s crawlability. It can also help you avoid duplicate content issues.

On the other hand, if used incorrectly, a robots.txt file can do more harm than good. It’s important to avoid common mistakes and follow the best practices that will help you use your robots.txt file to its full potential and improve your website’s SEO.In addition to expertly navigating Robot.txt files, dynamic rendering with Prerender also offers the opportunity to produce static HTML for complex Javascript websites. Now you can allow faster indexation, faster response times, and an overall better user experience.

DEV Community