DEV Community

Shan Desai
Shan Desai

Posted on • Originally published at shantanoo-desai.github.io

How Reproducibility + Documentation in Software goes deep: Practical Scenario

Problem

I have been working on a personal project which I recently open-sourced called Komponist.
In a nutshell, the project is heavily reliant on some of the most widely used tools today in
the software landscape i.e., Ansible, Docker, Docker Compose.

Everything was smooth-sailing with a new feature planned when suddenly I started getting a
strange error when bringing my container stack up using:

docker compose --project-directory=deploy up -d
Enter fullscreen mode Exit fullscreen mode

The error was rather mysterious and not much verbose. It screamed something like:

Error response from daemon: Could not find the file / in container <container_id>
Enter fullscreen mode Exit fullscreen mode

This seemed rather odd, the same thing that worked a couple of days back now is rendered
buggy! As usual common searches didn't lead me anywhere. What was rather weird was the fact that
I never was mounting any root directory anywhere (/). The good thing about this error was
I had only two realms to look into, either Docker Compose v2 tool or the Docker Engine

Hackin' Around

Since this was a sudden bug that I faced with, I suspected initially something going wrong with
the Compose Files (docker-compose.yml) file which my tool generates. However, this felt rather
counter-intuitive because the tool itself also takes up the initiative to validate all my Compose Files.
Had there been any error, the validation logic would have failed and hence would not have generated me
a production-ready file in the first place.

Through some previous recollections, I was conscious about User ID within the container causing problems
especially when trying to mount secrets into Docker Containers. In essence, my tool mounts Secrets (credentials)
into the container via Docker Compose's specification which copies the values under the /run/secrets directory
within a container.

This feature, although great had some small caveats. As security best-practice most containers have a user within
them, which does NOT have enough privileges to actually mount the secrets into the /run/secrets directory.
The only exception being container with root user (which is generally not the best idea in the first place).

A common fix is to add user parameter to each service with the current Host's User ID, e.g., 1000 which
will run the container as the same user ID as that of the host's user. This was already part of the current system
in my project.

I started hacking around thinking this particular value might be the main culprit so I tried adding
various combinations like user: "1000:1000" or user: "<my_host-machine_name>" but it did not work.

A last dumpster dive into some GitHub Issues on the Moby Project which is essentially the Docker Engine
landed me onto Issue 34142 for Moby Project which is open since 2017 (yes you read that right!).

At wit's end, I thought if the user on my host has stopped working, I might as well give the user within the image
a shot i.e., if I have a Grafana container, upon running:

$ docker run -it grafana/grafana-oss:latest whoami
grafana
Enter fullscreen mode Exit fullscreen mode

so updating my user parameter to user: "grafana" suddenly brought got everything working!

Temporary fix for the Project! Hurrah!

Reporting the Behaviour

I generally am very thorough in reporting bugs / issues to Open-Source Projects. Essentially I start
with mentioning my Host, all the software versions that are affected (here, Docker Engine / Docker Compose v2)
and provide a Minimum Example that can be reproduced by anyone. Since the fix was pertaining to Docker Compose v2's
user property I reported a bug on Github's docker/compose repo.

The Maintainers had a look into it, but then came the nightmare answer of:

Tried to reproduce it, I got something else !

Dread of "It works on my machine, can't reproduce it"

After a few back and forth with the maintainer and even trying out certain examples, it felt like I had messed up
my Docker Engine with some configuration. It wasn't the case, since my configuration file was the least bit
complicated and it was close to running an out-of-the-box Docker Engine.

It seemed like this was going to be one of those GitHub Issues that would be left open and gets lost in the
Oblivion of Open-Source Projects. Everything seemed hay-wired.

One Last Effort!

Whilst the back and forth between my Bug report and the Maintainer's non-reproducible errors started to reduce,
I reverted back to the an older Docker Engine Version to see if a sudden update started to cause this error on my end.

For Context, the Issue was about using Docker Engine v24.x with Docker Compose v2.18.1. I downgraded the
Docker Engine to the last v23.x.x and would you know it! Things started to work again.

I was able to pinpoint out that things were going wrong, since the leap of faith from Docker Engine v23 to v24.
I mentioned this difference in the Issue's Thread

So I was convinced that something is breaking between the Docker Engine versions but I wasn't sure whether there was
some discrepancy in Docker Compose v2 too.

This led me to a previous Compatibility Anaylyses that I had done about a year back called
Exploring the Marshland: Docker Compose Versions
where I was able to isolate different versions of the software to determine which versions were working with what
specific parameters. The tool I had used was Vagrant.

Isolate, Compare, Determine!

I took the initiative of creating two Isolated Vagrant Boxes with the following software:

Vagrant Box Docker Engine Version Docker Compose Version(s)
Box 1 v23.0.6 v2.18.1
v2.17.3
v2.16.0
Box 2 v24.0.2 v2.18.1
v2.17.3
v2.16.0

Documenting the Discrepancy!

I decided that I would rather add GIFs of the reproduction of the errors as opposed to adding a lot of
verbosity to an already complicated problem and create a repo that would reduce the time for the Bug reproduction.
This does come with an assumption that the Maintainer would also have Vagrant or something similar to test at their
end.

I was able to reproduce the same bugs in a rather isolated environment using Vagrant disproving my theory that I had
messed my Docker Engine up and perhaps the Maintainer should try the stuff out again at their end.

The Repo can be Found on GitHub

Inference

Upon providing an environment to reproduce this error as well as some GIF documentation, it turned out that this was
actually an error even the Maintainer can reproduce. This error is related to the Docker Engine v24 and not Docker Compose v2
and it is now being tracked with fixes already on the way.
See Moby/Moby Issue #45719.

This might look like a lot of work, which in hindsight might be, but it is worth being comfortable with tools where a decent
amount of reproducibility via Isolation can provide a strong foundation to look into software a bit more deeper!

Vagrant is meant to be Developer-Friendly tool which indeed helped me overcome the dreadful situation of

It worked on my machine, why isn't it working on yours?!

It also helped collaborate on a problem which stemmed not in Docker Compose v2 but went all the way down to Docker Engine
itself. The maintainer was able to produce the same error with Canonical's multipass which also
reached the same goal through a different route.

So in a nutshell, a few hours of going in the deep end of error reproducibility may feel like time wasted but it might
just solve a problem that is long-lasting (till something new pops up!)

Top comments (0)