I use GitHub Pages for my personal website, as well as for several project sites. Although some static site generators include support for sitemap generation (e.g., Jekyll has a plugin for sitemaps), my personal website is generated by a custom static site generator that I built for a few specialized reasons, and most of my project sites for Java libraries consist of a single hand-written HTML page combined with javadoc-generated documentation. So a while back I implemented a GitHub Action, generate-sitemap, that can generate an XML sitemap by crawling a GitHub repository containing the HTML of the site. It uses the last commit date of each file to produce the <lastmod>
tags. By default, it includes URLs for HTML and PDF files in the sitemap, and skips other file extensions in the repository. But it can be configured to include URLs corresponding to whatever file extensions you want included. It checks the head of HTML pages for noindex
meta tags, and excludes such files from the sitemap, and it likewise excludes files from the sitemap if they match a Disallow
rule in your robots.txt
. The generate-sitemap can be configured in a few other ways as well (see the documentation in the GitHub repository for all details). The generate-sitemap action is implemented in Python as a container action.
Table of Contents: This post is organized as follows:
Prerequisite Workflow Step
In order for the <lastmod>
dates to be correctly determined, the step that checks out your repository must use actions/checkout
's optional input fetch-depth: 0
in order to get the full git history, such as with a step like the following:
steps:
- name: Checkout the repo
uses: actions/checkout@v3
with:
fetch-depth: 0
Example Workflow
Here is an example workflow. It runs on pushes to the branch main
. It then starts with the checkout as described above. The generate-sitemap action assumes that the entire repository is the website by default (you can change that behavior with the input path-to-root
). The most important input is probably base-url-path
, which specifies the URL to the root of your site. This example workflow includes html and pdf files in the sitemap by default. There are optional inputs that can be used to exclude either of these, and an optional input additional-extensions
that can be used to additionally include files of any specific type you desire in the sitemap.
name: Generate xml sitemap
on:
push:
branches: [ main ]
jobs:
sitemap_job:
runs-on: ubuntu-latest
name: Generate a sitemap
steps:
- name: Checkout the repo
uses: actions/checkout@v3
with:
fetch-depth: 0
- name: Generate the sitemap
uses: cicirello/generate-sitemap@v1
with:
base-url-path: https://www.example.com/
- name: Commit and push
run: |
if [[ `git status --porcelain sitemap.xml` ]]; then
git config --global user.name 'github-actions'
git config --global user.email '41898282+github-actions[bot]@users.noreply.github.com'
git add sitemap.xml
git commit -m "Automated sitemap update" sitemap.xml
git push
fi
The generate-sitemap action doesn't commit and push, so you need a step in your workflow to do that. In the above example workflow, the last step uses a simple shell script to commit and push. This example does the commit as the github-actions
bot. If you'd rather be the committer, then adjust that step as necessary. There are also actions in the GitHub Marketplace that can be used for the commit and push step if you prefer.
Learn More
You can find more information about this GitHub Action in its GitHub repository:
cicirello / generate-sitemap
Generate an XML sitemap for a GitHub Pages site using GitHub Actions
generate-sitemap
Check out all of our GitHub Actions: https://actions.cicirello.org/
About
The generate-sitemap GitHub action generates a sitemap for a website hosted on GitHub Pages, and has the following features:
- Support for both xml and txt sitemaps (you choose using one of the action's inputs).
- When generating an xml sitemap, it uses the last commit date of
each file to generate the
<lastmod>
tag in the sitemap entry. If the file was created during that workflow run, but not yet committed, then it instead uses the current date (however, we recommend if possible committing newly created files first). - Supports URLs for html and pdf files in the sitemap, and has inputs to control the included file types (defaults include both html and pdf files in the sitemap).
- Now also supports including URLs for a user specified list of additional file extensions in the sitemap.
- …
You can also find information about this GitHub Action, as well as others I've implemented and maintain at the following site (which by the way is served via GitHub Pages, and uses this action to generate its sitemap):
Where You Can Find Me
Follow me here on DEV:
Follow me on GitHub:
Vincent A Cicirello
View My Detailed GitHub Activity
If you want to generate the equivalent to the above for your own GitHub profile, check out the cicirello/user-statistician GitHub Action.
Or visit my website:
Top comments (1)
If you'd like to see samples of the sitemaps that this action generates, here are links to a couple.
This first example sitemap is a very small site with 6 pages: actions.cicirello.org/sitemap.xml
This next sample is from my personal website with a sitemap with a couple hundred URLs: cicirello.org/sitemap.xml