Aggressive dependency caching in GitHub Actions

#devops #webdev #node #python

There are three things you can watch forever: fire burning, water falling, and how the build passes the stages in Pipeline after the next commit. To make the wait less tedious, it's best to take care of the CI setup from the beginning.

GitHub Actions has a cache that gets to the runner's virtual machine in seconds. In this article I'd like to share examples of how to set up aggressive dependency caching. Why did I call this approach "aggressive caching"? Because we will be caching not only the packages archives but also the state of the environment after installation.

For Node.js it will be the node_modules directory, and for Python it will be the virtualenv directory with installed dependencies.

Node.js Example

Let's take the typical setup for dependency caching example mentioned in the documentation. If you don't need any exotics, you can use the standard actions/setup-node action, specifying a package manager.

steps:
- uses: actions/checkout@v3
- uses: actions/setup-node@v3
  with:
    node-version: 16
    cache: 'npm'
- run: npm ci
- run: npm test

This will save the .npm directory with the global package cache. Sounds great! Remember that if we have several workflow jobs requiring npm ci inside, this will also be time-consuming.

Let's imagine a pipelining with several jobs:

                         │
          Create         │                        Reuse
          Dependencies   │                        Dependencies
          Cache          │
                         │  ┌────────────────────┐
                ┌────────┼──►     Lint  Job      ├────────────────────┐
                │        │  └────────────────────┘                    │
                │        │                                            │
                │        │                                            │
┌───────────────┴────┐   │  ┌────────────────────┐                 ┌──▼─────────────────┐
│     Build Job      ├───┼──►     Test  Job      ├─────────────────►     Deploy Job     │
└───────────────┬────┘   │  └────────────────────┘                 └──▲─────────────────┘
                │        │                                            │
                │        │                                            │
                │        │  ┌────────────────────┐                    │
                └────────┼──►     E2E   Job      ├────────────────────┘
                         │  └────────────────────┘
                         │
                         │
                         │
                         │

Ideally, we want to install dependencies only in the first job and get a state with available dependencies in all subsequent jobs.
I'll show how to achieve it using a sample repo — redux-react-realworld-example-app.

The first (build) job might look like this:

steps:
- uses: actions/checkout@v3
- uses: actions/setup-node@v3
  with:
    node-version-file: '.nvmrc' # (1)
    cache: 'npm'

- name: Cache NPM dependencies # (2)
  uses: actions/cache@v3
  id: cache-primes
  with:
    path: node_modules
    key: ${{ runner.os }}-node-${{ hashFiles('package-lock.json') }}

- name: Install dependencies # (3)
  if: steps.cache-primes.outputs.cache-hit != 'true'
  run: npm ci

- name: Build
  run: npm run build

Line #1 specifies the node version using .nvmrc file. That's the alternative way to specify the version and it helps follow the DRY - Don't Repeat Yourself principle.

In line #2 we use actions/cache to cache the node_modules directory. We use the hash from the package-lock.json file as the key.

In line #3 we only install dependencies if the cache is invalidated.

To automatically retrieve node_modules in subsequent jobs, you must declare actions/cache with the same key. For example the test job can be configured as:

steps:
- uses: actions/checkout@v3
- uses: actions/setup-node@v3
  with:
    node-version-file: '.nvmrc'
    cache: 'npm'

- name: Cache NPM dependencies
  uses: actions/cache@v3
  with:
    path: node_modules
    key: ${{ runner.os }}-node-${{ hashFiles('package-lock.json') }} # (1)

- name: Tests
  run: npm run test # (2)

Line #1 specifies the cache key. The key must be the same as in the build job. After the actions/cache step we consider that the dependencies are installed and run the tests in line #2.

Check out the complete workflow on GitHub.

Python Example

Standard scenario from actions/setup-python docs:

steps:
- uses: actions/checkout@v3
- uses: actions/setup-python@v4
  with:
    python-version: '3.9'
    cache: 'pip' # caching pip dependencies
- run: pip install -r requirements.txt

This workflow will cache pip packages in ~/.cache/pip, but the installation step will always be performed, as in the previous example with npm ci.

Let's see how we can optimize the installation of dependencies. I'll use the Django-based education-backend repo.

Let's dive into the build job:

steps:
- uses: actions/checkout@v3

- uses: actions/setup-python@v4
  id: setup-python
  with:
    python-version-file: '.python-version'

- uses: actions/cache@v3
  with:
    path: venv
    key: ${{ runner.os }}-venv-${{ steps.setup-python.outputs.python-version }}-${{ hashFiles('**/*requirements.txt') }} # (1)

- name: Install dependencies # (2)
  if: steps.cache-primes.outputs.cache-hit != 'true'
  run: |
    python -m venv venv
    . venv/bin/activate
    pip install --upgrade pip pip-tools
    pip-sync requirements.txt dev-requirements.txt

- name: Run the linter
  run: |
    . venv/bin/activate # (3)
    cp src/app/.env.ci src/app/.env
    make lint

As you can see, we use the same idea for caching as for the Node.js project. There are some minor changes which are quite important. We need to specify the cache key for each python version involved in the workflow. Line #1 has steps.setup-python.outputs.python-version variable exactly for this purpose.

Dependencies installation from line #2 is tricky. For python we use a virtual environment created with the module venv. The environment directory venv will be cached. You can think about it as node_modules for node.

Line #3 has one more trick. After the cache is warmed up it's necessary to initialize virtualenv in the future steps. Otherwise, the python interpreter will not be able to detect the necessary libraries to import.

The simplified test job may look as follows:

- uses: actions/checkout@v3

- uses: actions/setup-python@v4
  id: setup-python
  with:
    python-version-file: '.python-version'

- uses: actions/cache@v3
  with:
    path: venv
    key: ${{ runner.os }}-venv-${{ steps.setup-python.outputs.python-version }}-${{ hashFiles('**/*requirements.txt') }} # (1)

- name: Run the tests
  run: |
    . venv/bin/activate # (2)
    cp src/app/.env.ci src/app/.env
    make test

Ensure you use the same key for the caching step (line #1) and remember to activate the virtual environment before running the tests (line #2).

Check out the complete workflow on GitHub.

Summary

We've practiced "aggressive caching" with Node.js and Python examples. As far as you have a significant number of dependencies the changes can speed up your GitHub workflow sensibly. I recommend trying to set up workflows for your project using the references I've mentioned:

If you still have questions about caching in GitHub Actions, don't hesitate to ask in the comments. I'll try to help.
I would be grateful if you share your tips on how to speed up workflows on GitHub too.