Shalvah

Posted on Mar 5, 2021 • Edited on Aug 18, 2023 • Originally published at blog.shalvah.me

Understanding Lockfiles

#dependencymanagement #engineeringconcepts

Lockfiles are common in many dependency management systems today — package-lock.json, composer.lock, and so on. We often don't think much about them, but they're a key part of our software development workflows. To understand how important lockfiles are, we need to understand the problem of reproducible builds.

Reproducible builds

A build is reproducible (aka "deterministic", "idempotent", or "predictable") when it can be run at different times on different machines and will always yield the same results. In terms of dependency management, a reproducible build means that every time and everywhere you install your dependencies (composer install, bundle install, etc), you should get the same versions of your dependencies.

Why do we need reproducible builds? The first reason is stability. If npm install on your machine gives you version 2.4 of my-awesome-package, but version 2.5 on a different machine, things might not behave the same. Software is fragile. Even if there aren't any breaking changes between 2.4 and 2.5, there might be changes in behaviour that affect your app. For instance, version 2.5 might have changed the implementation or signature of a function in a way that breaks your app, or performs worse than 2.4.

A simple example (inspired by Don't use functions as callbacks unless they're designed for it):

const dates = [new Date("2020-11-12 00:00"), new Date("2021-03-05 14:06")];

(function v2_4() {
  // In version 2.4, toFriendlyDate() takes a single argument, a date
  function toFriendlyDate(date) {
    return date.toLocaleString('en-US');
  }
  // So this is fine
  console.log(dates.map(toFriendlyDate));
  // Result: ["11/12/2020, 12:00:00 AM", "3/5/2021, 2:06:00 PM"]
})();

(function v2_5() {
  // In version 2.5, a second argument is added to represent the locale
  function toFriendlyDate(date, locale = 'en-US') {
    return date.toLocaleString(locale);
  }

  // This leads to unexpected results:
  console.log(dates.map(toFriendlyDate));
  // Result: ["12/11/2020, 00:00:00", "05/03/2021, 14:06:00"]
})();

Here, a simple backwards-compatible change in a dependency has resulted in our app displaying dates differently, without us changing our code. On a teammate's machine, it uses the first display, but on yours it shows differently. On production, it might even show one sometimes and the other at times (if you're using multiple production servers). Imagine trying to debug that! We'd probably check to see if we made any code changes, not knowing a simple dependency drift is responsible.

This is a pretty tame example. Security is also a concern: an attacker might get access to the package and release a version 2.6 that acts the same, but secretly reads your access tokens and sends to them. This has happened before. With non-locked builds, it's easier for this malicious version to make its way into your codebase.

Locking dependencies

So how do lockfiles solve this? They help us "lock down" our dependencies. When you run composer install for the first time in a project, it looks at the dependencies you declared in composer.json, figures out the exact versions you want and then locks them down in composer.lock. It will do this for all dependencies used in your project, including indirect dependencies (the dependencies of your dependencies).

On future installs, it uses this lock file to fetch dependencies, ensuring that those exact versions are installed.

For example, here's a project where I specify shalvah/clara as a dependency in my composer.json.

{
    // ...
    "require": {
        "shalvah/clara": "^2.2"
    }
}

Composer talks to packagist.org to figure out what versions Clara has available, then uses my declared version to decide on an option that satisfies my app and any other dependencies that want Clara, then installs this and saves to the lockfile. Here's a snippet from the resulting composer.lock:

{
    "packages": [
        {
            "name": "shalvah/clara",
            "version": "2.6.0",
            "source": {
                "type": "git",
                "url": "https://github.com/shalvah/clara.git",
                "reference": "f1d8a36da149b605769ef86286110e435a68d9ac"
            },
            "dist": {
                "type": "zip",
                "url": "https://api.github.com/repos/shalvah/clara/zipball/f1d8a36da149b605769ef86286110e435a68d9ac",
                "reference": "f1d8a36da149b605769ef86286110e435a68d9ac",
                "shasum": ""
            },
            "require": {
                "php": ">=7.2.5",
                "symfony/console": "^4.0|^5.0"
            }
        },
        {
            "name": "symfony/console",
            "version": "v5.2.1",
            "source": {
                "type": "git",
                "url": "https://github.com/symfony/console.git",
                "reference": "47c02526c532fb381374dab26df05e7313978976"
            },
            "dist": {
                "type": "zip",
                "url": "https://api.github.com/repos/symfony/console/zipball/47c02526c532fb381374dab26df05e7313978976",
                "reference": "47c02526c532fb381374dab26df05e7313978976",
                "shasum": ""
            },
            "require": {
                //...
            },
        }
    ]
}

There's actually a lot more in that file, but let's focus on the version locking for now.

You can see that Composer has resolved our shalvah/clara version range (^2.2) to an exact version 2.6.0, with a URL that points to the Git source of that version, and another pointing to a zip file of the same version. And shalvah/clara itself depends on symfony/console in the range ^4.0|^5.0, and Composer resolved that to 5.2.1, along with the corresponding URLs. It does the same thing for all symfony/console's dependencies, and so on. This way, when we add the composer.lock file to our project, and run composer install on a new machine, we'll get these exact versions. Nice!

Why use ranges in the first place?

You might be wondering, if we want exact versions, why do we even use version ranges in the first place? Why not just declare that we want shalvah/clara at exactly 2.6.0 and avoid having Composer needing to do any locking.

You could absolutely do that, but a good reason not to is compatibility. If we declare fixed versions in our app, what happens when we pull in a library that depends on a different version of one of our dependencies? Suppose we require symfony/console at exactly 5.1, and shalvah/clara wants symfony/console at exactly 4.4. A platform like npm might be able to handle that (with some difficulty), but Composer would likely fail to install, because that would mean having conflicting class names. When we write Symfony\Component\Console\Application, there will be no way to decide which version we're referring to.

This is why libraries use version ranges; by declaring that it works with version 4 or 5 of symfony/console. shalvah/clara has extended its compatibility. And if we want our apps to be compatible as well, using a version range is a safe bet.

There's also the issue of convenience. Using a version range allows us to make changes more easily. If Composer locked symfony/console to 4.2.1, and a security issue is discovered and fixed in 4.2.2, with a version range like ^4.2, we could simply run composer update, and Composer would re-lock at 4.2.2. If we had locked it to 4.2.1 in our composer.json, we'd have to manually change that to 4.2.2, and do so each time we wanted to update any dependencies.

Differences across package managers

Like with anything in programming, the idea of lockfiles and deterministic installs is implemented in different ways across different platforms. I started out writing with npm in mind, but I switched to Composer because it uses a more strict behaviour: composer install will install exactly from the lockfile if it's present. Even if you add a new dependency to your composer.json, composer install will only install from composer.lock. To update your dependencies based on composer.json, you must run composer update.

npm is a bit different: npm install will create a lockfile (package-lock.json) if none exists, but beyond that, it doesn't really respect the lockfile. Running npm install when a lockfile is present will update your dependencies and the lockfile. This means that npm install isn't reproducible by default. For that, npm provides npm ci. Yarn has yarn install --frozen-lockfile.

In the Ruby ecosystem, Bundler does conservative updating: like composer install, bundle install will install only from your Gemfile.lock if it's present; however, if you change the Gemfile manually, then bundle install will try to update the relevant locked versions in Gemfile.lock.

Go modules make use of a go.mod file, which is kind of like composer.json and composer.lock rolled into one. Like composer.json, it contains a description of your dependencies, you can create it manually, and you can edit it to add a new dependency. Like composer.lock, it may also contain the dependencies of your dependencies, and it is the sole source of truth for your build. go mod download, go build, and most other commands will fetch all dependencies in this file. Go uses the go.mod file for deterministic installs, but in a different way from our other examples: the versions specified in the file are not "locked", but instead treated as the minimum allowed version. Go then uses a process called minimal version selection to figure out what versions will staisfy all your dependencies. Since it chooses the minimum possible version, builds are guaranteed to be reproducible.

Another important difference, but a bit out of our scope today is the syntax for version ranges.

On npm, ^ allows minor release updates while ~ allows only patch release updates (docs).
Composer supports both ^ and ~, but ~ behaves slightly differently: it will allow minor or patch release updates, depending on the version specified (docs).
In Ruby gems, ~> behaves like Composer's ~ (docs).
As mentioned earlier, you don't specify version ranges in Go, only the minimum allowed version.

Integrity checking

One more thing that varies between platforms is whether they use the lockfile for integrity checking or not.

Version locking is one part of solving the reproducible builds challenge. Another part is integrity checking. With version locking, Composer can transform our request for package A in a certain range into a specific version of A, located at a specific URL. But how do we guarantee that that URL always returns the same contents? A URL is an address to something on the web, but there's no guarantee that it will always have the same content. This can happen in so many ways:

I could publish this blog post and later edit it so it says something entirely different, without changing the URL.
I could change the code in my repo, shalvah/clara while keeping the Git tag name (and hence the URL) as 2.6.0. This means that someone installing 2.6.0 later on would get a different codebase.
Someone could change the DNS on your machine so that the URL for the package points to their own server, or they could use a man-in-the-middle attack to intercept your requests and send you something different.

Integrity checking is a way of verifying that the content you downloaded is the content you're expecting. How this works is typically by storing a hash of the content. This hash is either provided by you, the user (the safer option), or computed the first time it is installed. On subsequent tries, if the hash of the incoming content does not match the expected hash, the installation will be aborted, since it means that the content has been changed. In browsers, this is known as subresource integrity.

How do different package managers implement integrity checks?

npm

First off, the npm registry requires tags to be immutable. This means that once a version 2.1 is tagged on npm, it can never be changed. To change that code, you must release a new tag.

On the client side, the npm package manager stores content hashes in the package-lock.json. Here's a small snippet from the file:

{
  "dependencies": {
    "@google-cloud/common": {
      "version": "3.5.0",
      "resolved": "https://registry.npmjs.org/@google-cloud/common/-/common-3.5.0.tgz",
      "integrity": "sha512-10d7ZAvKhq47L271AqvHEd8KzJqGU45TY+rwM2Z3JHuB070FeTi7oJJd7elfrnKaEvaktw3hH2wKnRWxk/3oWQ==",
      "optional": true,
      "requires": {
        "@google-cloud/projectify": "^2.0.0",
        "@google-cloud/promisify": "^2.0.0",
        "arrify": "^2.0.1"
      }
    }
}

Like with Composer, the version and URL to the package (in the resolved field) are stored. The hash of the content is stored in the integrity field.

Composer

Composer prefers to use Git commit hashes, as those are immutable (once you make a commit, its hash cannot be changed; amending a commit creates a new hash). It stores this hash in the reference key, like we saw earlier:

        {
            "name": "symfony/console",
            "version": "v5.2.1",
            "source": {
                "type": "git",
                "url": "https://github.com/symfony/console.git",
                "reference": "47c02526c532fb381374dab26df05e7313978976"
            },
            "dist": {
                "type": "zip",
                "url": "https://api.github.com/repos/symfony/console/zipball/47c02526c532fb381374dab26df05e7313978976",
                "reference": "47c02526c532fb381374dab26df05e7313978976",
                "shasum": ""
            },
        }

There's also a shasum field, which is a fallback for when the package doesn't use Git. In those cases, Composer will hash the content (or use a specified hash) and store it in this field. Subsequent installs will be checked against that hash.

Bundler

Bundler's registry, Rubygems.org, also makes tags immutable. Bundler creates a Gemfile.lock, but doesn't do any integrity checking in it. Here's what a Gemfile.lock looks like:

GEM
  remote: https://rubygems.org/
  specs:
    activesupport (6.1.3)
      concurrent-ruby (~> 1.0, >= 1.0.2)
      i18n (>= 1.6, < 2)
      minitest (>= 5.1)
      tzinfo (~> 2.0)
      zeitwerk (~> 2.3)
    addressable (2.7.0)
      public_suffix (>= 2.0.2, < 5.0)
        marcel (0.3.3)
      mimemagic (~> 0.3.2)
    method_source (1.0.0)
    mimemagic (0.3.5)
    mini_mime (1.0.2)
    minitest (5.14.4)
    msgpack (1.4.2)
    nio4r (2.5.5)

You can see that it's a pretty simple list of all dependencies (including sub-dependencies), locked at a specific version, and the gems they require. The remote field specifies the registry that Bundler will pull the gems from, so it doesn't need to store separate URLs for each gem; it trusts that https://rubygems.org/downloads/mimemagic-0.3.5.gem will always hold version 0.3.5 of the mimemagic gem.

There's an open discussion on adding integrity checks to Bundler, but no progress yet. However, Rubygems does generate a hash for each library version when it's uploaded (displayed on the gem page), so you could do a manual check by generating the hash after downloading and comparing. It also supports signing of gems, but most gems aren't signed.

Go modules

The Go module system uses a separate file, go.sum, to store hashes for the modules it downloads.

For example, a go.mod file like this

module myawesomeapp

go 1.14

require github.com/pborman/uuid v1.2.1

could lead to a go.sum file like this:

github.com/google/uuid v1.0.0 h1:b4Gk+7WdP/d3HZH8EJsZpvV7EtDOgaZLtnaNGIu1adA=
github.com/google/uuid v1.0.0/go.mod h1:TIyPZe4MgqvfeYDBFedMoGGpEw/LqOeaOT+nhxU+yHo=
github.com/pborman/uuid v1.2.1 h1:+ZZIw58t/ozdjRaXh/3awHfmWRbzYxJoAdNJxe/3pvw=
github.com/pborman/uuid v1.2.1/go.mod h1:X/NO0urCmaxf9VXbdlT7C2Yzkj2IKimNn4k+gtPdI/k=

The go.sum file contains hashes for github.com/pborman/uuid, as well as github.com/google/uuid, which is a dependency of github.com/pborman/uuid. The hashes in the go.sum file are computed locally from the module's contents, but on the first install, they are fetched from Go's checksum database.

Apps vs libraries

You might have noticed one distinction we've had to draw often throughout this article: apps versus libraries. Libraries are meant to be added to apps; apps are meant to be deployed and run. Locking your dependencies is something you usually want for apps, since they will run on a machine somewhere else, but how about for libraries? Libraries are different because they will always be used in the context of someone else's app. So should you include your lockfiles in your Git repo for libraries? There's a fair bit of debate on this topic (this post has 43 comments!):

npm doesn't mention apps or libraries, but says you should commit it, so that everyone works with the same dependencies.
Some popular OSS maintainers (Sindre Sorhus, Dan Abramov) believe you shouldn't commit them in liibraries, so that npm install gives you the latest versions of your dependencies and you can catch any breaking changes before your users.
Yarn says you should always commit. Their argument is that your package is installed and tested more frequently by your users than by your contributors, so even if you don't lock versions, your users will probably still catch breaking changes before you do.
Composer says it's up to you.

Committing your lockfile means contributors can have stability; not committing might mean discovering issues with dependency upgrades earlier. Ultimately, it doesn't matter too much which you go with. If you want to emulate what your users are using, you should use a good CI setup for that. For instance, you can use composer update --prefer-lowest to test against the lowest supported versions of your dependencies. For example, my build config for Scribe tests against the lowest and highest supported Laravel versions on each of PHP 7.3, 7.4 and 8.0

Note that it's safe to commit your lockfile in your libraries. When your package is installed as a library, the package management system will ignore any lockfiles from libraries.

Lockfiles don't solve all problems

As with anything in computing, lockfiles come with their own set of problems. They tend to be large and optimized for machine-readability, not humans, so:

They clutter our Git diffs (which is why GitHub and other VCS services collapse them by default)
We don't often pay attention to them. This means that they can be a security blindspot.

However, lockfiles are definitely very useful. They help you ensure that the code running on your machine is the same as the one on production, getting rid of a whole set of problems to debug.

DEV Community