DEV Community

Tomasz Wegrzanowski
Tomasz Wegrzanowski

Posted on

Open Source Adventures: Episode 16: Git Content Hash

I want to showcase another small project. This one is a feature which feels like it should be included in git, but isn't.

It's in unix-utilities repo which I use as a dumping ground for a bunch of potentially useful command line tools. I should probably clean it up, as quite a few are obviously obsolete by now, but there are some true gems there!

The Problem

So here's the problem, typical CI flow at most places:

  • you have a branch X, you do a PR, CI checks that it's fine
  • you merge it, CI goes through it all over again
  • then if it passes, it get deployed to prod

Each CI run would typically take 30min - 1h.

But hold on, why are we doing CI on master, if we already CI'd exact same code before and it passed? That's what git_hash is trying to solve.

When after-merge CI isn't needed

CI-on-the branch is not problematic at all. The PR is likely going to take more than 1h to review, approve, and merge anyway, so CI running concurrently doesn't slow us down at all. What we want is avoid CI-on-master, as that costs us a lot of extra latency.

But if you branched off current master, or merged master into your branch recently, CI-on-master is completely redundant.

But we shouldn't automatically skip CI-on-master, as you might be merging something that was branched off older master. Just because git doesn't show conflicts doesn't mean there won't be CI fails.

So the problem is just automatically detecting such situation. And it's super easy - just hash all files in the repo. Git is basically built of hashes of hashes all the way down, but as far as I know (and I did this years ago, so maybe it changed since) there's no easy way to get a hash of just content.

The Code

#!/usr/bin/env ruby

require "digest/sha1"

def git_hash(dir)
  Dir.chdir(dir) do
    tree = []
    tree << ["/submodules", `git submodule`.split(/\n/).sort]
    `git ls-files`.split(/\n/).each do |path|
      if File.directory?(path)
        next
      else
        tree << [path, Digest::SHA1.hexdigest(File.read(path))]
      end
    end
    Digest::SHA1.hexdigest(tree.sort.inspect)
  end
end

puts git_hash(ARGV[0] || ".")
Enter fullscreen mode Exit fullscreen mode

I don't really remember what that submodule part does.

The rest should be obvious enough. We just throw everything into tree, then sort it to normalize, inspect to stringify, and hexdigest it.

You can get the code here.

How to integrate it with CI

To integrate it with CI you need to do two things:

  • before you start CI, check content hash of current branch, if we already succeeded, and if so, skip the tests and return green
  • if CI passes, send content hash of current branch somewhere to make it as passed

You shouldn't save hashes of failures, as you might get flakey tests, and you definitely don't want to make that even harder on yourself.

I did it primarily to improve latency, but I guess it also reduces energy waste and saves you some money if you pay by usage; or saves you some load if you pay by capacity.

Was it successful?

Yes. I only really used this solution at in one place, but it's applicable to a lot of projects.

Back when I created it I don't think anything like that existed, but I really wouldn't be surprised if it was independently reinvented by someone else.

Coming next

In the next few episodes I'll continue showcasing various tiny projects I did.

Oldest comments (0)