MylifeforAiur

Posted on Dec 14, 2022

Gatsby SEO: Manage robot text file in different environments

#gatsby #seo #webdev #javascript

Background

When we are talking about SEO, the very first thing to handle is robot.txt that tells search engine crawlers which URLs they can access on your site. Within dev environment you want none of your pages on cook to be visible in google search results. And in prod you want the opposite. Robot txt is the gate keeper for so. Robot txt file has four directives:

User-agent - specify which bots like Googlebot, AdsBot-Google, bingbot, slurp
Disallow - specify files, directories to block bots
Allow - by default it is allow. This is to whitelist urls like you can disallow a directory and allow just some files in it.
Sitemap - reference the location of site maps. Let's leave it for another post for sitemaps.

Problem

Site starter/template doesn't have robot.txt or the file doesn't change with development/production environments. Even accidentally leaves disallow setting to production env.

site with mis-configured disallow can block search engine badly

Solution

Give an example for Gatsby starter site but principle applies to all starters:

install plugin/node modules to manage robot txt file - This is overkill in my opinion. Update static files according to different build stages is simple enough. Maintaining a library is always an overhead.
Save two versions of text files and use them with build stages. I vote this solution for the mere simplicity. List steps here:

Step 1. Change default file used by site to disallow. Sample is gatsby site so the default one is in static folder.

Step 2. Add robot-prod.txt to directory like SEO, only lists urls disallow here. Urls are relative and regular expression ready.

Step 3. Update your build file to copy it to root folder

Step 4. Verify search results change, notice it will work after search engine cache refresh. For Google it will take 24 hours.

For devs reference, here is the sample commit to make it working on my gatsby blog starter

https://github.com/gatsbyjs/gatsby-starter-blog/commit/075e61748c8e90eb09621ae6b812225d7607da07

Call out

Robot.txt must be placed under root folder of site.
Rules applies to relative urls only.
Disallow does not guarantee your pages will fly under the radar. They will still get indexed if other pages refer to them. Use the noindex robots meta tag or X-Robots-Tag HTTP header to completely block bots.

Hola, one commit and 3 steps to bring your robot text completely under control. 🐢 Fruit can't be lower, chuck the code in your repo now. Next post I will talk about site menu of crawl bots - sitemap.

DEV Community

Gatsby SEO: Manage robot text file in different environments

Background

Problem

Solution

Call out

Next

Other articles of this series

Ref links

Top comments (0)

Read next

🌟 NestJS + Databases: Making Snake Case Seamless!🐍

My React Journey: Day 15

Why EchoAPI is a Better Choice Than Postman for Your API Needs

Conditional Statements and Loops in JavaScript