Henry Williams

Posted on May 6 • Edited on Oct 23

Caching PNPM Modules in Docker Builds in GitHub Actions

#webdev #ci #githubactions #docker

TLDR;

You can use reproducible-containers/buildkit-cache-dance to reuse cached files in your Docker build and an action like tespkg/actions-cache or actions/cache to persist the cache externally. See GitHub repo for an example setup.

Jump to Implementation section to see how everything is set up.

Introduction

Problem Statement

We all know how massive the dependency tree for NPM modules can get. And while PNPM provides significant performance improvements for local development, the same can't always be said for CI pipelines. For instance, in GitHub Actions, it's not uncommon for each build to install the same NPM modules every time since the jobs start with a clean state.

This problem is compounded when we start taking into account modules like node-gyp that need to be compiled from source, which only increases the amount of time it takes to install the modules.

It's fairly straightforward to cache files in a standard CI pipeline, but it gets more complicated when the modules are installed as part of a Docker build.

Background

In my scenario, the primary problem with downloading / compiling the NPM modules every time wasn't the time wasted. The problem was that there were some native / compiled dependencies that didn't play well with ARM-based CPU, which we used since they're cheaper to run on AWS than their x64 counterparts. The result was flaky builds that failed around 30% of the time and the only workaround was to manually re-run the build until it succeeded.

Caching the modules nearly eliminated the number of build failures. The caching approach worked well since the NPM modules only changed once or twice a week for the project.

Other attempted approaches

Docker layer caching - required too much disk space and was very volatile
Base image with NPM modules - complex to implement since a new Docker image would have to be created every time the NPM modules change

Limitations

There are likely better ways to solve this problem, but given the size of the team and the urgency of the problem, we needed a solution that was relatively simple to implement, required minimal maintenance, and could be implemented sooner rather than later.

Deep Dive

As stated in the introduction, caching files in a standard CI pipeline is fairly straightforward. However, it's not as straightforward to do within Docker due to its limitations.

Docker limitations

As of writing (May 2024), Docker only supports externally caching layers, but not cache mounts; the cache mounts are only temporarily available during the build. So if, say, we install NPM modules, there's no native way to access the generated files from outside of Docker.

For instance, RUN --mount=type=cache,target=/pnpm_cache,rw will correctly cache the files in /pnpm_cache and will be able to re-use it between builds. However, any state / files generated on a worker is cleared between runs in GitHub Actions, rendering the cache useless for this scenario.

The currently proposed solution is to allow Docker to bind the cache directory in the build to a directory on the host. This way the cache could be persisted externally. However, this issue has been opened for almost 4 years (May 27, 2020) with no clear answer as to whether it'll be implemented any time soon.

This is where the reproducible-containers/buildkit-cache-dance GitHub Action comes to the rescue! This Action is able to extract the files from the Docker build so they can be persisted on an external storage like S3 and is the approach recommended on the official Docker documentation.

Solution

The solution is to use the reproducible-containers/buildkit-cache-dance GitHub Action to extract / inject the cache generated by the Docker build and then use tespkg/actions-cache to save the cache in S3.

Workflow

After running a Docker build, reproducible-containers/buildkit-cache-dance extracts the files from the mounted directory and copies them to a directory on the host machine so it can be accessed outside of the context of Docker
tespkg/actions-cache uploads cache to S3. The cached files are compressed and are much smaller (10-20%) than the extracted file. In my experience, ~3GB of cache data for PNPM is compressed to less than 300MB.

[Cache hit scenario]

tespkg/actions-cache downloads cache from S3 and extracts the contents into the provided directory
reproducible-containers/buildkit-cache-dance grabs the files from the provided directory and injects them into the Docker build

Implementation

Requirements

AWS S3 bucket
AWS IAM user with access to the created bucket
Dockerfile to build your project
GitHub Action to build the Docker image

Example setup

Below is a simple setup for caching to S3. I've also set up a GitHub repository with the full setup.

GitHub workflow

---
name: Build
on:
  push:

jobs:
  Build:
    runs-on: ubuntu-22.04
    steps:
      - uses: actions/checkout@v4
      - uses: docker/setup-buildx-action@v3
      - uses: docker/metadata-action@v5
        id: meta
        with:
          images: Build

      - name: Cache (S3)
        uses: tespkg/actions-cache@v1
        id: cache
        with:
          bucket: ${{ vars.CACHE_BUCKET }}
          accessKey: ${{ vars.AWS_ACCESS_KEY }}
          secretKey: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          ## Fallback to GitHub cache if saving / restoring from S3 fails
          use-fallback: true
          path: |
            pnpm
          key: pnpm-cache-${{ hashFiles('pnpm-lock.yaml') }}
          restore-keys: |
            pnpm-cache-

      - name: inject cache into docker
        uses: reproducible-containers/buildkit-cache-dance@v3.1.0
        with:
          cache-map: |
            {
              "pnpm": "/pnpm"
            }
          # Skip extraction if cache was hit to avoid unnecessary I/O. This can take minutes for projects with a lot of dependencies.
          skip-extraction: ${{ steps.cache.outputs.cache-hit }}

      - name: Build
        uses: docker/build-push-action@v5
        with:
          context: .
          cache-from: type=gha
          cache-to: type=gha,mode=max
          file: Dockerfile
          push: false
          tags: ${{ steps.meta.outputs.tags }}
          labels: ${{ steps.meta.outputs.labels }}

Dockerfile

Note that the target directory used when mounting the cache is the same as the the directory specified in the cache-map provided to reproducible-containers/buildkit-cache-dance in the Workflow definition since that's were the cache is injected to and extracted from.

FROM node:21-slim AS base

ENV PNPM_HOME="/pnpm"

RUN corepack enable

# Copy your application code
WORKDIR /app
COPY . .

# Log for troubleshooting. There should be files in the directory when there's a cache hit
RUN --mount=type=cache,target=${PNPM_HOME} echo "PNPM contents before install: $(ls -la ${PNPM_HOME})"

### This is where the magic happens! The cache has been mounted to `$PNPM_HOME` so it can be accessed during the build ####
RUN  --mount=type=cache,target=${PNPM_HOME} \
  pnpm config set store-dir ${PNPM_HOME} && \
  pnpm install --frozen-lockfile --prefer-offline

# Another log for troubleshooting. This should never be empty since the NPM modules were installed before running this line
RUN --mount=type=cache,target=${PNPM_HOME} echo "PNPM contents after install: $(ls -la ${PNPM_HOME})"

FROM node:alpine AS prod

WORKDIR /app

COPY --from=base /app/node_modules /app/node_modules
COPY --from=base /app .

CMD ["npm", "start"]

Cache in action

Cache miss example

Cache hit example

File saved to S3

Conclusion

Although caching NPM modules inside the Docker build worked for my use case, it might not be the best option for you. Because of the time it takes to download the cache, inject the files into the Docker image, and extract the files from the Docker image, this caching approach will likely not yield any performance improvements over just installing the modules without using a cache.

However, if you're looking to solve build failures due to something like compilation failures or NPM rate-limit issues, then caching is a viable solution.

Caveat

Like anything in software development, this approach is subject to become outdated. So look up the latest information on "preserving cache mounts in Docker" in case this has changed.