DEV Community

Mike Fallows
Mike Fallows

Posted on • Originally published at mikefallows.com on

Adding a `robots.txt` file to an Eleventy site

Having set up a fair few sites over the years, I periodically get a chunk of emails from Google's Search Console notifying me of indexing errors. Usually, these are pretty small potatoes caused by things like old products being unpublished and will resolve themselves on the next crawl.

I realised I hadn't yet set up a robots.txt file for this site when a couple of errors popped up in Search Console that I wouldn't have expected. The errors I had were for:

https://mikefallows.com/cdn-cgi/l/email-protection
https://mikefallows.com/admin/
Enter fullscreen mode Exit fullscreen mode

I had already set up a sitemap.xml file but somehow overlooked creating a robots.txt file at the time. The /admin/ URL was easy for me to identify, that is for my Forestry integration which I use as a CMS. That page has a noindex meta tag in the head.

<meta name="robots" content="noindex" />
Enter fullscreen mode Exit fullscreen mode

I was mistaken

I was under the impression that would be enough to signal to Search Console that it should be ignored, but it turns out that it's being included in my sitemap.xml so it's (quite rightly) marked as invalid.

The other URL that started /cdn-cgi/l/email-protection was more of a mystery. I hadn't added anything in a /cdn-cgi/ folder! The fact that it contained a reference to a CDN was a clue, so I wondered if it was related to Netlify, but I couldn't think of an obvious reason why it would have any reference to emails. After a bit of quick research, I realised this was related to Cloudflare which I'd recently set up for the site. As I had activated their proxy in front of the site, it explained the unknown folder and it appears to be a part of their bot protection.

So to fix these validation errors in Search Console I needed to:

  • add a robots.txt file that disallows /admin/ and /cdn-cgi/
  • exclude /admin/ from my sitemap.xml

Adding a robots.txt file

This is super-easy in Eleventy. I created a file: src/robots.txt; and added the following to my .eleventy.js config:

// Put robots.txt in root
eleventyConfig.addPassthroughCopy({ 'src/robots.txt': '/robots.txt' });
Enter fullscreen mode Exit fullscreen mode

The addPassthroughCopy method will just copy the file "as is" into the generated _site folder. Great.

My robots.txt file looked like this:

User-agent: *
Allow: /
Disallow: /admin/
Disallow: /cdn-cgi/

Host: https://mikefallows.com/

Sitemap: https://mikefallows.com/sitemap.xml
Enter fullscreen mode Exit fullscreen mode

The important parts were the two Disallow rules that tell bots that they shouldn't try to crawl or index those paths in their results.

You can also view whatever the current version is.

Excluding pages from sitemap.xml

My sitemap is generated by a single sitemap.xml.njk file:

---
permalink: /sitemap.xml
eleventyExcludeFromCollections: true
---
<?xml version="1.0" encoding="utf-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
{%- for page in collections.all %}
  {%- set adminUrl = r/^\/admin\//i.test(page.url) %}
  {%- set draft = page.data.draft %}
  {%- if not adminUrl and not draft %}
    {%- set absoluteUrl %}{{ page.url | url | absoluteUrl(metadata.url) }}{% endset %}
    <url>
      <loc>{{ absoluteUrl }}</loc>
      <lastmod>{{ page.date | htmlDateString }}</lastmod>
      <changefreq>{{ page.data.changeFreq if page.data.changeFreq else "monthly" }}</changefreq>
    </url>
  {%- endif %}
{%- endfor %}
</urlset>
Enter fullscreen mode Exit fullscreen mode

This generates an XML file for all pages, excluding any pages where the URL begins /admin/ or is marked as a draft. I usually default my posts to draft until I'm ready to publish them. Draft posts are excluded by checking if the frontmatter has draft value set to true.

The key bit was writing the Regex to test that a URL begins /admin/:

set adminUrl = r/^\/admin\//i.test(page.url)
Enter fullscreen mode Exit fullscreen mode

Just to break that down:

  • r/ regular expressions in Nunjucks need to be prefixed with r
  • ^ indicates we're only matching the start of the string
  • \/admin\/ literally matches /admin/
  • /i makes the test case insensitive (y'know, just in case)

What I find most enjoyable about having my own site is having the time to tinker and dig through these types of issues. When they're low pressure like this one, it's great to spend a little time polishing and learning in a way that I often miss in client work. For such a small task, I solidified my knowledge just a little bit more about Eleventy, Sitemaps, Regex, and Robots files and that's mostly due to taking the time to write it up.

Top comments (0)