DEV Community

epassaro
epassaro

Posted on • Updated on

 

Reduce your build times on GitHub Actions by caching Anaconda environments

Overview

One of the most time-consuming tasks on my workflows is the solving, download and installation of Anaconda environments. In some cases, just solving the dependencies can take up to 10 minutes depending on the platform you are building on.

That's why I'm always looking for ways to increase the speed of my workflows. For example, a very well known method is using the blazing-fast mamba package manager instead of conda.

mamba is written in C++, download files in parallel, and uses libsolv (a state of the art library used in the RPM package manager of Red Hat, Fedora and OpenSUSE) for much faster dependency solving.

But usually this is not enough fast for me. Also, I find it a waste of resources downloading the packages every time a collaborator pushes a commit to a pull request. For example, in the open source project I collaborate, the CI pipeline can be triggered more than a hundred times in a single day.

That's why always wanted to cache the Anaconda environment, but didn't have the time to solve the issue, until now.

The documentation of the actions/cache task includes examples for many package managers, but not for Anaconda. On the other hand, the documentation of the setup-miniconda action describes a way to cache the downloaded packages, but currently that makes the pipeline even slower.

The cache action

It's important to understand the scope of the cache action. From GitHub's documentation:

A workflow can access and restore a cache created in the current branch, the base branch (including base branches of forked repositories), or the default branch (usually main). For example, a cache created on the default branch would be accessible from any pull request. Also, if the branch feature-b has the base branch feature-a, a workflow triggered on feature-b would have access to caches created in the default branch (main), feature-a, and feature-b.

My workflow

In this example I'm going to show how to write an example CI pipeline with the following features:

  • Runs on the three major operating systems (Linux, macOS and Windows)
  • Updates cache every 24 hours
  • Updates cache when environment.yml is modified
  • Cache can be reset manually

Let's get started!

Triggers

We want a pipeline that is triggered when:

  • A commit is pushed to any branch of the main repository
  • A commit is pushed to a pull request
  • Every day at 00:00 UTC
name: ci

on:
  push:
    branches:
      - '*'

  pull_request:
    branches:
      - '*'

  schedule:
    - cron: '0 0 * * *'

env:
  CACHE_NUMBER: 0  # increase to reset cache manually
Enter fullscreen mode Exit fullscreen mode

The CACHE_NUMBER variable is going to be used later.

Prefixes

We need to set up matrix to handle the different installation paths of Mambaforge*:

jobs:
  build:

    strategy:
      matrix:
        include:
          - os: ubuntu-latest
            label: linux-64
            prefix: /usr/share/miniconda3/envs/my-env

          - os: macos-latest
            label: osx-64
            prefix: /Users/runner/miniconda3/envs/my-env

          - os: windows-latest
            label: win-64
            prefix: C:\Miniconda3\envs\my-env
Enter fullscreen mode Exit fullscreen mode
  • Mambaforge is a custom build of Miniconda with mamba package manager pre-installed and conda-forge as default channel.

Install Mambaforge

At the step level, we install Mambaforge without specifying a YAML environment file.

    name: ${{ matrix.label }}
    runs-on: ${{ matrix.os }}
    steps:
      - uses: actions/checkout@v2

      - name: Setup Mambaforge
        uses: conda-incubator/setup-miniconda@v2
        with:
            miniforge-variant: Mambaforge
            miniforge-version: latest
            activate-environment: my-env
            use-mamba: true
Enter fullscreen mode Exit fullscreen mode

Cache

The cache task work with keys. When the task is executed, looks for a saved cache that matches the key and retrieves the data.

Cache is specific for every OS. Also, I set up the key in a way that will update the cache every 24 hours or if the environment has changed.

The CACHE_NUMBER variable defined above is meant to reset the cache manually.

      - name: Set cache date
        run: echo "DATE=$(date +'%Y%m%d')" >> $GITHUB_ENV

      - uses: actions/cache@v2
        with:
          path: ${{ matrix.prefix }}
          key: ${{ matrix.label }}-conda-${{ hashFiles('environment.yml') }}-${{ env.DATE }}-${{ env.CACHE_NUMBER }}
        id: cache
Enter fullscreen mode Exit fullscreen mode

Update the environment

Finally, if the cache is not available, update the environment according to the YAML environment file, and run the tests.

      - name: Update environment
        run: mamba env update -n my-env -f environment.yml
        if: steps.cache.outputs.cache-hit != 'true'

      - name: Run tests
        shell: bash -l {0}
        run: pytest ./tests
Enter fullscreen mode Exit fullscreen mode

Results

Despite our environment.yml file is very simple, we saved 5 minutes on average on every run.

results

Get the code

The code is available here:

GitHub logo epassaro / cache-conda-envs

Speed up your builds by caching Anaconda environments on GitHub Actions

cache-conda-envs 🐍 ⚡

Speed up your builds by caching Anaconda environments on GitHub Actions

Top comments (0)

Timeless DEV post...

Git Concepts I Wish I Knew Years Ago

The most used technology by developers is not Javascript.

It's not Python or HTML.

It hardly even gets mentioned in interviews or listed as a pre-requisite for jobs.

I'm talking about Git and version control of course.

One does not simply learn git