I've been feeling that every time I'm in a longer project, I end up putting together some simple shell scripts to gather statistics from the project git repository. These were always ad-hoc, needed to be tailored to each specific project, and the only way to follow the progress of the numbers was to run them and save the output.
Recently I had some free time, so I decided to finally write a proper program to do this stuff for me. It can now do everything that my ad-hoc scripts previously (actually somewhat more even), it stores whatever it computes so it takes very little time to keep up to date, and I'm happy enough with the functionality and code that I'm comfortable releasing it. So, meet Git Hammer.
The main thing that my scripts always had was the count of lines per person. Essentially, the script would run
git blame on every source file and add all the counts together. This is, in a way, the core of Git Hammer also. Another thing is to count all the tests and group those too by person. But Git Hammer knows about all the commits, so it can do many kinds of statistics based only on the commits and not their contents. At least in theory; I haven't yet implemented much.
One nice feature is support for multi-repository projects. In the project where I was working when I first began to plan Git Hammer, we had the main app, but also several support libraries in other repositories. These libraries were being developed by the same team, but they were separated to allow other projects to use them too. So it makes sense to combine all these repositories under one set of statistics.
Let's take a look at some graphs that Git Hammer can draw. I'm using the dev.to repository for these. First, let's look at line counts per author:
Well, that certainly displays the case where existing code was imported into a new repository. It's also not a very good graph: The legend with the author names is covering part of the data, and not nearly all authors are displayed. Running this kind of program on a repository with many many contributors can definitely uncover problems.
How many tests are getting written? Let's look at just the raw test counts this time.
Looks like a nice development. New tests are being written along new code.
We can also look at when the commits are happening. There is one graph for days of the week, and another for hours of the day.
Looks like primarily of a day job: The majority of commits happen Monday to Friday, roughly during business hours.
By the way, this last graph uncovered a bug. I had been very happy with my graphs, but when I first saw the hour-of-day graph for dev.to, it was showing most activity happening in the night. Of course, this was a time zone issue: At some point in the processing, the commit times got converted to my local time zone (Berlin). Since most of the commits happen in New York, this pushed the times 6 hours ahead. So I did what seems to be the most common advice: I store the time zone associated with the commit explicitly in the database, and then use that when reading for display.
git blame on every file in the repository probably sounds like it takes a long time. And it can. The main repository of my old project requires about 4 minutes. Of course, Git Hammer doesn't run this from scratch for every commit. Rather, it uses diffs provided by git to adjust its counts only where they might have changed. Processing the dev.to repository (about 1300 commits, 70000 lines of code in the latest version) took only 6 minutes on my Macbook Pro.
Larger repositories are a different case. My old project has over 33000 commits, maybe 250000 lines of code, and it takes over 12 hours to go through. Luckily, the process was using only about 20% of CPU and even towards the end well under 2 GB of memory, so I could keep working while it was running. Still, it may be that the time needed grows faster than the size of the repository, so trying a really massive repository is probably not a good idea.
Git Hammer is already almost a usable library. That will likely be the next step: Fix things that don't make sense in a library, maybe add some configuration points if needed, and upload to PyPI. I have also a long-term hope to make a Web service that uses Git Hammer to display project statistics on the Web.
Any contributions are welcome, starting from just ideas for features. The code base is also not very large, since a lot of the heavy lifting is handled by GitPython and SQLAlchemy. So it is probably comprehensible to many Python developers.
Deriving statistics from code is for entertainment purposes only. They have very little meaning, and none outside the specific project team, and should not be used as a basis for any decisions.