DEV Community

MylifeforAiur
MylifeforAiur

Posted on

Gatsby SEO: Manage robot text file in different environments

crawling bot

Background

When we are talking about SEO, the very first thing to handle is robot.txt that tells search engine crawlers which URLs they can access on your site. Within dev environment you want none of your pages on cook to be visible in google search results. And in prod you want the opposite. Robot txt is the gate keeper for so. Robot txt file has four directives:

  • User-agent - specify which bots like Googlebot, AdsBot-Google, bingbot, slurp
  • Disallow - specify files, directories to block bots
  • Allow - by default it is allow. This is to whitelist urls like you can disallow a directory and allow just some files in it.
  • Sitemap - reference the location of site maps. Let's leave it for another post for sitemaps.

Problem

Site starter/template doesn't have robot.txt or the file doesn't change with development/production environments. Even accidentally leaves disallow setting to production env.

site with mis-configured disallow can block search engine badly
site with mis-configured robot.txt

Solution

Give an example for Gatsby starter site but principle applies to all starters:

  • install plugin/node modules to manage robot txt file - This is overkill in my opinion. Update static files according to different build stages is simple enough. Maintaining a library is always an overhead.

  • Save two versions of text files and use them with build stages. I vote this solution for the mere simplicity. List steps here:

Step 1. Change default file used by site to disallow. Sample is gatsby site so the default one is in static folder.

dev env to disallow

Step 2. Add robot-prod.txt to directory like SEO, only lists urls disallow here. Urls are relative and regular expression ready.

sample robot.txt for production

Step 3. Update your build file to copy it to root folder

command to add robot.txt to production

Step 4. Verify search results change, notice it will work after search engine cache refresh. For Google it will take 24 hours.

For devs reference, here is the sample commit to make it working on my gatsby blog starter

https://github.com/gatsbyjs/gatsby-starter-blog/commit/075e61748c8e90eb09621ae6b812225d7607da07

Call out

  1. Robot.txt must be placed under root folder of site.
  2. Rules applies to relative urls only.
  3. Disallow does not guarantee your pages will fly under the radar. They will still get indexed if other pages refer to them. Use the noindex robots meta tag or X-Robots-Tag HTTP header to completely block bots.

Next

Hola, one commit and 3 steps to bring your robot text completely under control. 🐢 Fruit can't be lower, chuck the code in your repo now. Next post I will talk about site menu of crawl bots - sitemap.

Other articles of this series

SEO action list

Ref links

  1. The ultimate guide to robots.txt
  2. robot txt FAQ
  3. Completely block search bots
  4. Gatsby starter with SEO action list

Top comments (0)