DEV Community

Cover image for Generate an XML Sitemap for a Static Website in GitHub Actions
Vincent A. Cicirello
Vincent A. Cicirello

Posted on

Generate an XML Sitemap for a Static Website in GitHub Actions

I use GitHub Pages for my personal website, as well as for several project sites. Although some static site generators include support for sitemap generation (e.g., Jekyll has a plugin for sitemaps), my personal website is generated by a custom static site generator that I built for a few specialized reasons, and most of my project sites for Java libraries consist of a single hand-written HTML page combined with javadoc-generated documentation. So a while back I implemented a GitHub Action, generate-sitemap, that can generate an XML sitemap by crawling a GitHub repository containing the HTML of the site. It uses the last commit date of each file to produce the <lastmod> tags. By default, it includes URLs for HTML and PDF files in the sitemap, and skips other file extensions in the repository. But it can be configured to include URLs corresponding to whatever file extensions you want included. It checks the head of HTML pages for noindex meta tags, and excludes such files from the sitemap, and it likewise excludes files from the sitemap if they match a Disallow rule in your robots.txt. The generate-sitemap can be configured in a few other ways as well (see the documentation in the GitHub repository for all details). The generate-sitemap action is implemented in Python as a container action.

Table of Contents: This post is organized as follows:

Prerequisite Workflow Step

In order for the <lastmod> dates to be correctly determined, the step that checks out your repository must use actions/checkout's optional input fetch-depth: 0 in order to get the full git history, such as with a step like the following:

    steps:
    - name: Checkout the repo
      uses: actions/checkout@v3
      with:
        fetch-depth: 0 
Enter fullscreen mode Exit fullscreen mode

Example Workflow

Here is an example workflow. It runs on pushes to the branch main. It then starts with the checkout as described above. The generate-sitemap action assumes that the entire repository is the website by default (you can change that behavior with the input path-to-root). The most important input is probably base-url-path, which specifies the URL to the root of your site. This example workflow includes html and pdf files in the sitemap by default. There are optional inputs that can be used to exclude either of these, and an optional input additional-extensions that can be used to additionally include files of any specific type you desire in the sitemap.

name: Generate xml sitemap

on:
  push:
    branches: [ main ]

jobs:
  sitemap_job:
    runs-on: ubuntu-latest
    name: Generate a sitemap

    steps:
    - name: Checkout the repo
      uses: actions/checkout@v3
      with:
        fetch-depth: 0 

    - name: Generate the sitemap
      uses: cicirello/generate-sitemap@v1
      with:
        base-url-path: https://www.example.com/

    - name: Commit and push
      run: |
        if [[ `git status --porcelain sitemap.xml` ]]; then
          git config --global user.name 'github-actions'
          git config --global user.email '41898282+github-actions[bot]@users.noreply.github.com'
          git add sitemap.xml
          git commit -m "Automated sitemap update" sitemap.xml
          git push
        fi
Enter fullscreen mode Exit fullscreen mode

The generate-sitemap action doesn't commit and push, so you need a step in your workflow to do that. In the above example workflow, the last step uses a simple shell script to commit and push. This example does the commit as the github-actions bot. If you'd rather be the committer, then adjust that step as necessary. There are also actions in the GitHub Marketplace that can be used for the commit and push step if you prefer.

Learn More

You can find more information about this GitHub Action in its GitHub repository:

GitHub logo cicirello / generate-sitemap

Generate an XML sitemap for a GitHub Pages site using GitHub Actions

generate-sitemap

cicirello/generate-sitemap - Generate XML sitemaps for static websites in GitHub Actions

Check out all of our GitHub Actions: https://actions.cicirello.org/

About

GitHub Actions GitHub release (latest by date) Count of Action Users
Build Status build CodeQL
Source Info GitHub GitHub top language
Support GitHub Sponsors Liberapay Ko-Fi

The generate-sitemap GitHub action generates a sitemap for a website hosted on GitHub Pages, and has the following features:

  • Support for both xml and txt sitemaps (you choose using one of the action's inputs).
  • When generating an xml sitemap, it uses the last commit date of each file to generate the <lastmod> tag in the sitemap entry. If the file was created during that workflow run, but not yet committed, then it instead uses the current date (however, we recommend if possible committing newly created files first).
  • Supports URLs for html and pdf files in the sitemap, and has inputs to control the included file types (defaults include both html and pdf files in the sitemap).
  • Now also supports including URLs for a user specified list of additional file extensions in the sitemap.

You can also find information about this GitHub Action, as well as others I've implemented and maintain at the following site (which by the way is served via GitHub Pages, and uses this action to generate its sitemap):

Vincent Cicirello - Open source GitHub Actions for workflow automation

Features information on several open source GitHub Actions for workflow automation that we have developed to automate parts of the CI/CD pipeline, and other repetitive tasks. The GitHub Actions featured include jacoco-badge-generator, generate-sitemap, user-statistician, and javadoc-cleanup.

favicon actions.cicirello.org

Where You Can Find Me

Follow me here on DEV:

Follow me on GitHub:

GitHub logo cicirello / cicirello

My GitHub Profile

Vincent A Cicirello

Vincent A. Cicirello

Sites where you can find me or my work
Web and social media Personal Website LinkedIn DEV Profile
Software development Github Maven Central PyPI Docker Hub
Publications Google Scholar ORCID DBLP ACM Digital Library IEEE Xplore ResearchGate arXiv

My bibliometrics

My GitHub Activity

If you want to generate the equivalent to the above for your own GitHub profile, check out the cicirello/user-statistician GitHub Action.




Or visit my website:

Vincent A. Cicirello - Professor of Computer Science

Vincent A. Cicirello - Professor of Computer Science at Stockton University - is a researcher in artificial intelligence, evolutionary computation, swarm intelligence, and computational intelligence, with a Ph.D. in Robotics from Carnegie Mellon University. He is an ACM Senior Member, IEEE Senior Member, AAAI Life Member, EAI Distinguished Member, and SIAM Member.

favicon cicirello.org

Top comments (1)

Collapse
 
cicirello profile image
Vincent A. Cicirello

If you'd like to see samples of the sitemaps that this action generates, here are links to a couple.

This first example sitemap is a very small site with 6 pages: actions.cicirello.org/sitemap.xml

This next sample is from my personal website with a sitemap with a couple hundred URLs: cicirello.org/sitemap.xml