Talking about go.sum

#go #gosum

As you know, Go creates two files when it does dependency management, go.mod and go.sum.
Compared to go.mod there is much less information about go.sum. Of course, the importance of go.mod cannot be overstated, as this file contains almost all of the information about dependency versions. And go.sum appears to be a go module build result rather than human readable data.
But in practice, we still have to deal with go.sum in our day-to-day development (usually to resolve merge conflicts caused by this file, or to try to manually adjust its contents). If you don't know go.sum, you can't always get it right just by scribbling it in from experience. Therefore, to get a better understanding of Go's dependency management, it is absolutely necessary to know the ins and outs of go.sum.
Since information about go.sum is so sparse (even the official Go documentation describes go.sum in a fragmented manner), I've spent some time compiling the relevant information in the hope that readers will benefit from it.

The format of go.sum

Each line of go.sum is an entry, roughly in the form of

<module> <version>/go.mod <hash>

<module> <version> <hash> or
<module> <version>/go.mod <hash> or

where module is the path of the dependency and version is the version number of the dependency. hash is a string starting with h1:, indicating that the algorithm used to generate the checksum is the first version of the hash algorithm (sha256).
Some projects don't actually have a go.mod file, so the Go documentation refers to this /go.mod checksum with the phrase "possibly synthesized". Presumably, for projects without go.mod, Go will try to generate a possible go.mod and take its checksum.
If there are only checksum for go.mod, this is probably because the corresponding dependencies are not downloaded separately. For example, a vendor-managed dependency will only have a checksum for go.mod.
The rules for determining version are complicated by the heavy historical baggage of go's dependency management. The whole process is like a questionnaire that requires answering one question after another.

First, is the project tagged?

If the project is not tagged, a version number will be generated, in the following format.
v0.0.0-commitDate-commitID

For example github.com/beorn7/perks v0.0.0–20180321164747–3a771d992973/go.mod h1:Dwedo/Wpr24TaqPxmxbtue+5NUziq4I4S80YR8gNf3Q=.

Referring to a specific branch of a project, such as develop branch, generates a similar version number:
vcurrentVersion-commitDate-commitID

For example github.com/DATA-DOG/go-sqlmock v1.3.4–0.20191205000432–012d92843b00 h1:Cnt/xQ9MO4BiAjZrVpl0BiqqtTJjXUkWhIqwuOCVtWo=.

Second, does the project use go module?

If the project uses go module, then it is normal to use tag as version number.

For example, github.com/DATA-DOG/go-sqlmock v1.3.3 h1:CWUqKXe0s8A2z6qCgkP4Kru7wC11YoAnoupUKFDnH08=.

If the project is tagged but does not use the go module, you need to add a +incompatible flag to distinguish it from a project that uses the go module.

For example, github.com/google/martian v2.1.0+incompatible/go.mod h1:9I4somxYTbIHy5NJKHRl3wXiIaQGbYVAs8BPL6v8lEs=

Third, is the go module version used in the project v2+?

For more information about the v2+ feature of go module, please refer to Go's official documentation: https://blog.golang.org/v2-go-modules. In simple terms, it is a way to distinguish different versions of dependencies in the same project by making the dependency paths suffixed with version numbers, similar to the effect of gopkg.in/xxx.v2.

For projects that use v2+ go module, the project path will have a version number suffix.

For example, github.com/googleapis/gax-go/v2 v2.0.5 h1:sjZBwGj9Jlw33ImPtvFviGYvseOtDM7hkSKB7+Tv3SM=

The benefits of go.sum

The reason why Go introduces a role like go.sum for dependency management is to achieve the following goals:

(1) provide package management dependency content validation in a distributed environment

Unlike other package management mechanisms, Go takes a distributed approach to package management. This means that there is a lack of a trusted center for verifying the consistency of each package.

In mainstream package management mechanisms, there is usually a central repository to ensure that the content of each release is not tampered with. For example, in pypi, even if a released version has a serious bug, the publisher cannot re-release the same version, only a new one. (But you can delete the released version or delete the whole project, refer to the leftpad accident of npm, so the mainstream package management mechanism is not strictly Append Only.)

And Go doesn't have a central repository. Even if the publisher is an honest person, the publishing platform can be evil. So we can only store the checksum of all the components we depend on in each project to ensure that each dependency will not be tampered with.

(2) As transparent log to enhance security

Another special feature of go.sum is that it not only records the checksum of the current dependency, but also keeps the checksum of every dependency in the history. This follows the concept of transparent log. The transparent log is designed to maintain an Append Only log to increase the cost of tampering and to facilitate review of which records have been tampered with. According to Proposal: Secure the Public Go Module Ecosystem, the reason why go.sum uses transparent log for each checksum in the history is to facilitate the work of sum db.

The downside of go.sum

Needless to say, go.sum also brings some troubles.

(1) easy to generate merge conflicts

I'm afraid this is the most criticized part of go.sum. Since many projects do not manage releases by tagging, each commit is equivalent to a new release, which leads to pulling their code and occasionally inserting a new record into the go.sum file. go.sum's ability to record indirect dependencies makes this situation even worse. The impact of this type of project can be significant - my rough count of lines in go.sum is about 40% of the total number of such records. For example, golang.org/x/sys has as many as 37 different versions in go.sum for one project.

If there were just an inexplicable number of lines, it would be frowned upon at best. In a scenario where multiple people are collaborating and several internal public libraries are used that are frequently versioned, go.sum can be a headache.

Imagine this scenario:

The public library turns out to have version A.
Developer A's branch a relies on public library version B, and developer B's branch b relies on public library version C. They each add records to go.sum as follows:

# branch a
common/lib A h1:xxx 
common/lib B h1:yyyy

# branch b
common/lib A h1:xxx 
common/lib C h1:zzzz

After that the public repository releases version D, which contains the features of version B and version C.
Then branch a and branch b are merged into the trunk, and that's when there is a merge conflict.

Now there are two options.

incorporate both intermediate versions into go.sum
choose neither b nor c, and just go with version d

Whichever method is used, manual intervention is required. This certainly brings unnecessary workload.

(2) Lack of constraint for third-party libraries that operate indiscriminately

The intention of go.sum is to provide a tamper-proof guarantee, so that if the actual content of a third-party library is found to be different from the recorded checksum value when pulling it, the build process will exit with an error. However, that's about all it can do. go.sum's detection feature puts more of a burden on the users of the library than on the developers of the library. In other package managers with a central repository, one can restrict the troublemakers at the source from changing the released version. But the constraints imposed by go.sum are purely ethical. If a library messes with a released version, it will make the project that depends on it fail to build. There seems to be no solution for the user of the library other than to curse, rebuke the author in an issue or elsewhere, and update the go.sum file. The author of the library is the one who made the mistake, but the user of the library is the one who is in trouble. This is not a very clever design. One possible solution would be to have the official mirroring of the various versions of well-known libraries. Although well-known repositories usually don't make the mistake of messing with released versions, if it happens (or if it happens due to some force majeure), at least there is a mirror available. However, this goes back to the path of a single central repository.

(3) In practice, manual editing of go.sum is inevitable.

For example, as cited earlier, edit the go.sum file to resolve merge conflicts. I have also seen projects that only keep the latest version of the checksum of dependencies in go.sum. If go.sum is not fully managed by the tool, how can you guarantee that it is Append Only? If go.sum is not Append Only, how can you use it as a transparent log?