This post demonstrates enriching an XML sitemap with lastmod
timestamps based on git commits.
Reading git log in Node.js
In the last post I showed how to manipulate XML in Node.js, and filter our sitemap. In this post we'll build upon what we did last time, read the git log in Node.js and use that to power a lastmod
property.
To read the git log in Node.js we'll use the simple-git package. It's a great package that makes it easy to read the git log. Other stuff too - but that's what we care about today.
yarn add simple-git
To work with simple-git
we need to create a Git
instance. We can do that like so:
import { simpleGit, SimpleGit, SimpleGitOptions } from 'simple-git';
function getSimpleGit(): SimpleGit {
const baseDir = path.resolve(process.cwd(), '..');
const options: Partial<SimpleGitOptions> = {
baseDir,
binary: 'git',
maxConcurrentProcesses: 6,
trimmed: false,
};
const git = simpleGit(options);
return git;
}
From sitemap to git log
It's worth pausing to consider what our sitemap looks like:
<?xml version="1.0" encoding="UTF-8"?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:news="http://www.google.com/schemas/sitemap-news/0.9" xmlns:xhtml="http://www.w3.org/1999/xhtml" xmlns:image="http://www.google.com/schemas/sitemap-image/1.1" xmlns:video="http://www.google.com/schemas/sitemap-video/1.1">
<url>
<loc>https://blog.johnnyreilly.com/2012/01/07/standing-on-shoulders-of-giants</loc>
<changefreq>weekly</changefreq>
<priority>0.5</priority>
</url>
<url>
<loc>https://blog.johnnyreilly.com/2022/09/20/react-usesearchparamsstate</loc>
<changefreq>weekly</changefreq>
<priority>0.5</priority>
</url>
<!-- ... -->
</urlset>
If you look at the URL (loc
) you can see that it's fairly easy to determine the path to the original markdown file. If we take https://blog.johnnyreilly.com/2012/01/07/standing-on-shoulders-of-giants, we can see that the path to the markdown file is blog-website/blog/2012-01-07-standing-on-shoulders-of-giants/index.md
.
As long as we don't have a custom slug in play (and I rarely do), we have a reliable way to get from blog post URL (loc
) to markdown file. With that we can use simple-git
to get the git log for that file. We can then use that to populate the lastmod
property.
const dateBlogUrlRegEx = /(\d\d\d\d\/\d\d\/\d\d)\/(.+)/;
async function enrichUrlsWithLastmod(
filteredUrls: SitemapUrl[]
): Promise<SitemapUrl[]> {
const git = getSimpleGit();
const urls: SitemapUrl[] = [];
for (const url of filteredUrls) {
if (urls.includes(url)) {
continue;
}
try {
// example url.loc: https://blog.johnnyreilly.com/2012/01/07/standing-on-shoulders-of-giants
const pathWithoutRootUrl = url.loc.replace(rootUrl + '/', ''); // eg 2012/01/07/standing-on-shoulders-of-giants
const match = pathWithoutRootUrl.match(dateBlogUrlRegEx);
if (!match || !match[1] || !match[2]) {
urls.push(url);
continue;
}
const date = match[1].replaceAll('/', '-'); // eg 2012-01-07
const slug = match[2]; // eg standing-on-shoulders-of-giants
const file = `blog-website/blog/${date}-${slug}/index.md`;
const log = await git.log({
file,
});
const lastmod = log.latest?.date.substring(0, 10);
urls.push(lastmod ? { ...url, lastmod } : url);
console.log(url.loc, lastmod);
} catch (e) {
console.log('file date not looked up', url.loc, e);
urls.push(url);
}
}
return urls;
}
Above we're using a regular expression to extract the date and slug from the URL. We then use those to construct the path to the markdown file. We then use simple-git
to get the git log for that file. We then use the latest commit date to populate the lastmod
property, and push that onto the urls
array.
Finally we return the urls
array and write that to the sitemap before we write it out:
sitemap.urlset.url = await enrichUrlsWithLastmod(filteredUrls);
Our new sitemap looks like this:
<?xml version="1.0" encoding="UTF-8"?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:news="http://www.google.com/schemas/sitemap-news/0.9" xmlns:xhtml="http://www.w3.org/1999/xhtml" xmlns:image="http://www.google.com/schemas/sitemap-image/1.1" xmlns:video="http://www.google.com/schemas/sitemap-video/1.1">
<url>
<loc>https://blog.johnnyreilly.com/2012/01/07/standing-on-shoulders-of-giants</loc>
<changefreq>weekly</changefreq>
<priority>0.5</priority>
<lastmod>2021-12-19</lastmod>
</url>
<url>
<loc>https://blog.johnnyreilly.com/2012/01/14/jqgrid-its-just-far-better-grid</loc>
<changefreq>weekly</changefreq>
<priority>0.5</priority>
<lastmod>2022-11-03</lastmod>
</url>
<!-- ... -->
</urlset>
You see the lastmod
property has been populated for URLs based upon the most recent commit for that file. Yay!
GitHub Actions - fetch_depth
You might think we were done (I thought we were done), but we're not. We're not done because we're using GitHub Actions to build the site.
When I tested this locally, it worked fine. However, when I pushed it to GitHub Actions, it surfaced a latest.date
which wasn't populated with the value you'd hope. The reason was that the fetch_depth
was set to 1 (the default). This meant that the git log wasn't providing the information we'd hope for. By changing the fetch_depth
to 0 the situation is resolved.
- uses: actions/checkout@v3
with:
# Number of commits to fetch. 0 indicates all history for all branches and tags.
# Default: 1
fetch-depth: 0
Top comments (0)