DEV Community

Cover image for Advanced customizations
Top
Top

Posted on

Advanced customizations

When reading the Google sitemap recommendations, I found this bit of information:

List only canonical URLs in your sitemaps. If you have two versions of a page, list only the (Google-selected) canonical in the sitemap.

A "canonical" URL is the "true home" for a specific entity. If you have multiple URLs that contain the same content, you need to mark one as "canonical" for search engines to use.

If you don't do this, Google will penalize you, and it can hurt your search result rankings 😬

On my blog, post URLs are in the following format: /:category/:slug. This presents a problem, since posts can belong to multiple categories. For example, the post that you're reading right now can be reached through both of these URLs:

/gatsby/seo-friendly-sitemap/
/seo/seo-friendly-sitemap/
The posts on my blog are all written using MDX. In the frontmatter for the posts, I have data that looks like this:

---
title: "Generate an SEO-Friendly Sitemap for your Gatsby Site"
type: tutorial
publishedOn: 2020-03-09T09:30:00-0400
categories: ['gatsby', 'seo']
---
Enter fullscreen mode Exit fullscreen mode

Categories are listed in priority order, so the first category should always form the canonical URL.

The challenge is clear: I need to fetch the categories from my MDX frontmatter and use it to filter the sites generated in the sitemap. Delightfully, this is an option with the plugin!

Querying data with GraphQL
Inside our gatsby-config.js, we can write a GraphQL query to pull whatever data we need:

module.exports = {
    siteMetadata: {
    // ✂️
  },
  plugins: [
    {
      resolve: 'gatsby-plugin-sitemap',
      options: {
        exclude: ['/admin', '/confirmed'],
        query: `
          {
            site {
              siteMetadata {
                siteUrl
              }
            }
            allSitePage {
              edges {
                node {
                  path
                }
              }
            }
          }
        `,
      },
    },
  ],
};
Enter fullscreen mode Exit fullscreen mode

By default, the plugin uses a query like this, but we can overwrite it. Here it fetches the siteUrl

In order to filter out non-canonical results, we first need to expose the right data to GraphQL!

allSitePage is an index of every page created, either by putting a React component in src/pages, or using the createPage API. In my case, I'm generating all articles/tutorials programmatically with createPage.

Here's what a typical createPage call looks like, inside gatsby-node.js:

createPage({
  path: pathname,
  component: path.resolve(...),
  context: {
    /* component props */
  },
});
Enter fullscreen mode Exit fullscreen mode

If you're building a blog with Markdown or MDX, you're probably already using this to generate your pages. You provide it a path to live, a component to mount, and some contextual data that the component might need. Anything passed to context becomes available to the component via props.

Happily, it turns out that context also gets exposed to GraphQL!

I added a new piece of data to context:

createPage({
  path: pathname,
  component: path.resolve(...),
  context: {
    isCanonical: currentCategory === canonicalCategory
  },
});
Enter fullscreen mode Exit fullscreen mode

The currentCategory and canonicalCategory variables were already available to me, since I was iterating through all my data and using it to create these pages.

With this data added, I could update the GraphQL query passed to query, in my gatsby-config.js:

query: `
  {
    site {
      siteMetadata {
        siteUrl
      }
    }
    allSitePage {
      edges {
        node {
          path
          context {
            isCanonical
          }
        }
      }
    }
  }
`,
Enter fullscreen mode Exit fullscreen mode

Filtering pages

We've now exposed each page's "canonical status" to GraphQL, and written it into the query that gatsby-plugin-sitemap will use. The final piece of this puzzle: overwriting the default "serializer" to specify what should be done with this queried data.

Here's what that looks like:

{
  resolve: `gatsby-plugin-sitemap`,
  options: {
    exclude: ['/admin', '/confirmed'],
    query: /* ✂️ */,
    serialize: ({ site, allSitePage }) => {
      return allSitePage.edges
        .filter(({ node }) => (
          node.context.isCanonical !== false
        ))
        .map(({ node }) => {
          return {
            url: site.siteMetadata.siteUrl + node.path,
            changefreq: 'daily',
            priority: 0.7,
          };
        });
    },
  },
}
Enter fullscreen mode Exit fullscreen mode

serialize is a function that transforms the data from the query into an array of "sitemappy" objects. The items we return will be used as the raw data to generate the sitemap.

Now that we've specified it in GraphQL, we can access node.context.isCanonical to filter out duplicate pages.

By using the query and serialize escape hatches built into gatsby-plugin-sitemap, we are given far greater control over the generated sitemap. It also allows us to fine-tune some page-specific options!

Discussion (0)