Background
When we are talking about SEO, the very first thing to handle is robot.txt that tells search engine crawlers which URLs they can access on your site. Within dev environment you want none of your pages on cook to be visible in google search results. And in prod you want the opposite. Robot txt is the gate keeper for so. Robot txt file has four directives:
- User-agent - specify which bots like Googlebot, AdsBot-Google, bingbot, slurp
- Disallow - specify files, directories to block bots
- Allow - by default it is allow. This is to whitelist urls like you can disallow a directory and allow just some files in it.
- Sitemap - reference the location of site maps. Let's leave it for another post for sitemaps.
Problem
Site starter/template doesn't have robot.txt or the file doesn't change with development/production environments. Even accidentally leaves disallow setting to production env.
site with mis-configured disallow can block search engine badly
Solution
Give an example for Gatsby starter site but principle applies to all starters:
install plugin/node modules to manage robot txt file - This is overkill in my opinion. Update static files according to different build stages is simple enough. Maintaining a library is always an overhead.
Save two versions of text files and use them with build stages. I vote this solution for the mere simplicity. List steps here:
Step 1. Change default file used by site to disallow. Sample is gatsby site so the default one is in static folder.
Step 2. Add robot-prod.txt to directory like SEO, only lists urls disallow here. Urls are relative and regular expression ready.
Step 3. Update your build file to copy it to root folder
Step 4. Verify search results change, notice it will work after search engine cache refresh. For Google it will take 24 hours.
For devs reference, here is the sample commit to make it working on my gatsby blog starter
https://github.com/gatsbyjs/gatsby-starter-blog/commit/075e61748c8e90eb09621ae6b812225d7607da07
Call out
- Robot.txt must be placed under root folder of site.
- Rules applies to relative urls only.
- Disallow does not guarantee your pages will fly under the radar. They will still get indexed if other pages refer to them. Use the
noindex
robots meta tag orX-Robots-Tag
HTTP header to completely block bots.
Next
Hola, one commit and 3 steps to bring your robot text completely under control. 🐢 Fruit can't be lower, chuck the code in your repo now. Next post I will talk about site menu of crawl bots - sitemap.
Top comments (0)