Hollow Man

Posted on Aug 13, 2022 • Edited on Sep 24, 2022

My Summer of Bitcoin 2022 Project - CI for CADR

#devops #productivity #opensource #blockchain

Synopsis

Before the Summer of Bitcoin project, Cryptoanarchy Debian Repo (CADR) lacked Continuous Integration (CI), which troubles the new coming contributors because setting up the developing environment can be complex. I finally successfully implemented the CI using GitHub Actions default runners. The CI can be triggered manually, or by sending PRs as well as pushing directly to the master branch.

The CI is divided into 2 jobs:

The first one is the build job. It builds the running podman environment image, uploads the image to Artifacts for reuse. Then with the podman environment, builds the CADR packages, checks their sha256 sum, and then uploads the built Debian packages as well as their sha256 sum values to Artifacts.
The second one is the test job. It runs when the build job finishes. The testing jobs are run in parallel for each package. The test job first downloads the built images and packages Artifacts uploaded in the build job, then use make test-here-basic-% and make test-here-upgrade-%(% for package name) to run tests.

Road of Implementation

First try

At first, I ignored the fact that there already exists a dockerfile for CADR running (although it was not for building), and setup my own dockerfile from scratch by adding dependencies when I encounter any errors.

My own dockerfile turned out to work fine on GitHub Actions for the building process, but failed for the test process.
Initially I thought it was some more dependency issues, since the test can work on my own computer.

When checking the logs I found that it's due to the unshare issue.

Then I noticed that adding a --privileged parameter can fix the unshare issue for docker. But then a systemd issue just came after it. Finally I noticed the dockerfile in the codebase that already exists for CADR, and just as how it works, I made the systemd to start as the first process.

VOLUME [ "/sys/fs/cgroup" ]
CMD ["/lib/systemd/systemd"]

and add --tmpfs /tmp --tmpfs /run --tmpfs /run/lock -v /sys/fs/cgroup:/sys/fs/cgroup:ro as additional parameters to share some host machine resources and make it as the daemon container. Operations are done using docker exec to attach to the container. But still it doesn’t fix the systemd issue.

Finally I switched to podman and the systemd issue got fixed (it supports --systemd=true) because I occasionally found this article when I Googled the issue.

However, a new issue occurs suggesting failed to override dbcache.

Can't see any errors from the workflow logs, even though the missing bc dependency has been fixed and I can confirm that the command here works correctly in the container running on my PC. Moreover when I manually skip the test just mentioned, the test further shows that there is an unneeded reindex:

Which suggests that it's related to issue 108 maybe? Have opened an issue here. I can skip that test, but further error suggesting electrs is not available. It can be fixed with this PR

Even if those issues are fixed, another platform may still be needed since the full test requires too much space:

Only testing the regtest part will be fine with no space issue.

Fixing the tests

The first issue is to find out why overriding dbcache fails. This takes me quite a few weeks to find out the solution. Originally, when determining bitcoin dbcache size, bc is invoked. But bc is not guaranteed to be installed, thus it can fail. Then instead of doing maths wizardry, I submitted a PR to just match on ranges: RAM < 1024 -> default dbcache, RAM < 2048 -> dbcache=512, ..., RAM => 8192 -> dbcache=4096. However, the failed to override dbcache still exists when running in GitHub Actions workflow with this PR. It's very weird, as when I run the test manually using the same methods from the workflow with the podman container on my PC, such an issue won't appear. So the issue seems to belong to GitHub Actions workflow environment only even though it's running in the container.

Then after a lot of trial and error, I finally found the culprit. By referring to PR, previously the debcrafter dropped all capabilities which seems to cause errors when the host kernel capabilities do not match those known to setpriv. We need to only drop supported capabilities by the current kernel.

Then I submitted a PR to fix this, and got merged.

The second issue is to find out why the unneeded reindex error exists. I noticed that although the test failed on test-here-basic-electrs with unneeded reindex when running make test, running make test-here-basic-electrs alone won't fail. So prior to fixing this issue, my guess was that maybe it's still related to issue issue 108, which is that after updating existing non-pruned nodes in experimental -reindex was used despite pruning not being changed. Or the test environment didn't get cleaned up before test-here-basic-electrs when executing make test. My final result proves that the later is right, and submitted a PR to clean up the chain mode marker, since we want it to be clean with package_clean_install.sh

I also submitted a PR to force linked-hash-map version to be 0.5.4 for fixing the build issue of debcrafter that just came up during the project period.

The test can be successful on GitHub Actions when only tested with regtest and after PR 200, 201, 203, 204 get merged.

Research on cloud provider

I also researched on which cloud service to use, as we were intended to use a GItHub self-hosted runner. As I checked on cloud service providers AWS, Google Cloud and Azure. I find that GitHub Actions uses Azure as their default runners, to build a GitHub self-hosted runner, Azure would be a good choice if we use the same provider as the GitHub default hosted runner. Also Azure has a trial period with 200 dollars for one month when starting a new account, so we can start free, while other cloud service providers don’t offer such a great discount.

Investigation on other possible CI platforms

I find out that unlike GitHub Actions that runs CI in a virtual machine, Travis CI, GitLab CI/CD, Jenkins all run the CI predefined Docker containers without without systemd support. Then it would be not possible to run a systemd supported podman container inside such a container. Azure DevsOps is a valid one for our use case, and I also tried to use it. Then I noticed that for a single CI job, each step, it only allows 1 hour maximum, otherwise it would get cancelled, while our build job as well as full net test job last much longer than 1 hour, it's also not suitable for us. Then finally the GitHub Actions would be the only good choice to have.

Self-hosted runners for GitHub Actions

I had also tried to use Azure Virtual Machines service by myself to set up Self-hosted runners for GitHub Actions, but it seems like the environment doesn't automatically get cleaned each time, and is contaminated from previous runs.

Divide and Conquer

Finally I came up with a new way for testing. We can use the divide and conquer to bypass the short of disk space issue as well as the contaminated running environment issue when running tests on both the mainnet and regtest. I can see that the make test is composed of make test-here-basic-% and make test-here-upgrade-%(% for package name), we can just use the matrix in GitHub Actions to test each package in separate environments, and then the disk space would be enough, and PR 204 can be closed because we now always have a clean environment.

Then I successfully implemented and tested that method, and it works! So right now it's the full test and I have reached the project goal with even a solution that has no money cost for CI building and testing!

The CI can be successful if the previous tests fixing PRs get merged first.

Final Deliverables

After my mentor @Kixunil's review, I began to use --locked parameter to make cargo use cargo.lock which contains correct versions. Also I fixed a security issue, make a user account user and built/tested with user, and made to upload the sha256sum result for the built deb packages afterwards so people can check hashes.

In addition, I fixed tests for bitcoin-regtest after setting bitcoind nosettings enabled in PR 205. Then there should be some kind of bug in core for this wallet location difference in the GitHub runner and the physical machine, so the PR gets closed and another solution that makes tests independent of wallet location was committed.

The CI runs successfully with the above changes, and everything works fine, nothing left to do.

Conclusion

My Summer of Bitcoin project is a great experience for me as I learned a lot about Bitcoin and related DevOps knowledge.

If you are interested in Bitcoin and would like to start contributing to Cryptoanarchy Debian, with my work, you can easily fork the repo and add more test or new packages by committing the code to your fork's master branch on GitHub, the CI will help you build the deb packages and locate possible errors, no need to setup the developing environment on your computer again.

Summary of PRs

Merged

Closed as resolved in another way

Top comments (3)

nprojectcharles • Aug 18 '22

Great to have this will improve more and more community bonding and will help new PRs auto check the required test/missing compiler issues

prakhar728 • Aug 14 '22

Wow you really gained a lot of skills through the program! I tried last year but couldn't make it through. Any tips on how to get enrolled in the mentorship?

Hollow Man • Aug 17 '22

For my part, to get enrolled, I would say you must have a pretty detailed plan and thoroughly investigation to show that your solution is feasible and you can finish the job excellently. You can check my proposal for a reference. If you have already done all that, have your fingers crossed 😂. After all, the program is highly competitive and in 2022 they only select 83 people out of 20317 applications!