Assessing projects' sustainability on GitHub

#projects #github #dependencies #development

Before the quarantine, one of my former students asked me an interesting question: does GitHub sens an email to the owner of a repository, if an issue is opened on this repo? The answer is not important. As a former consultant, I tried to understand the root question. After digging a bit, I understood it: the student was using a project he found on GitHub. That project was lacking a feature, and the issue was to request it. After a casual glance, I realized I personally never would have used such a project. The reason is that I had performed preliminary checks on the project. More importantly, I also realized those checks were not straightforward for everybody. This post aims to describe my process of judging the reliability of a GitHub project before considering to rely on it.

NOTE: I’ll use the student’s project as an example, but there’s no judgement involved.

Commit timestamps of files

The first indicator is the commit timestamp of files. If the latest commit is years old, that’s a bad sign. If it’s months old, it depends on the other commits. The key is to have more or less regular commits.

This information is easily accessible from the homepage of the project:

In this example, while some files have been committed years ago, some have been committed much more recently. Two months is IMHO perfectly acceptable. This is a good sign.

The only good reason for a project not to have been touched in years is that it’s feature-complete. That can be assessed quite easily: either from the documentation, or a lot of third-parties are using the project.

README file

A README file is a sign of a project whose contributors thought about being used by others. As such, it increases the confidence in the reliability.

If there’s no README, documentation is a good substitute - though I’ve never witnessed any project with documentation but no README file.

The sample project has a non-trivial README file. This is another good sign.

LICENSE file

One should never ever use a project without checking the license first. A lack of license is a strong signal that the project was not designed for usage by third-party in mind. Worse, using such a project might lead to legal problems for users down the road.

On GitHub, the LICENSE file is the standard way to communicate the license. It should be one of the official Open Source licenses. Be cautious with non-standard licenses. While they may seem more open e.g. Do What the Fuck You Want to Public License, they are also a legal risk. Unless it’s non-public work, evaluate very carefully.

The sample project uses the standard MIT license. So far, so good.

Number of contributors

An additional indicator whether a project can be relied upon or not is its number of contributors. A high number hints at the interest and engagement of the community; a low number, at the opposite.

Beside, it’s also related to the bus factor:

The bus factor is a measurement of the risk resulting from information and capabilities not being shared among team members, derived from the phrase "in case they get hit by a bus."

However, the correlation between the number of contributors and the bus factor is not that good. The number of contributors is the overall number of persons who actually contributed anything during the whole life of the project. Contributions can be huge or small, and regular or once-in-a-lifetime. Somebody could have been very active in the past, and stopped contributing.

Hence, this one-sided metric should be broken down further. It’s hard to talk about generalities here, let’s have a look at the sample project:

On the six contributors:

Two contributed only once, k and A
One made two contributions, p
Three are significant contributors, M, J and c

Among the three main contributors, we can see that though c doesn't make a lot of contributions, he does it regularly.
Meanwhile, M is the project's creator as well as the largest contributor.
However, he didn't do anything on the repo since a couple of months ago.
Coupled with the history from J, it looks as if M "gave" the project to J:
one can see a small time period of overlap where both contributed, and then nothing more from the former, while the latter started.

With that in mind, the confidence in the project is going down a bit.

Contributor history

Now, some contributors might not have been working diligently on the project recently.
It's understandable, as long as it's just a temporary hiccup, and not a definite trend.

Let's look at the contributions' history of J:

From the contribution chart, we can see J started contributing in May 2019, and stopped in March 2020. The distribution looks a lot like a Gaussian one. It can be seen from the activity that the latest contribution is to create J's GitHub Pages.

At this point, my trust in the project's sustainability is gone.

Summary

While a single indicator doesn't mean much, the matrix of all indicators prints a pretty good picture of the project's reliability.
Yet, some indicators are stronger than others.

INDICATOR	STRENGTH
Commit timestamp	+++
README	++
LICENSE	++
Number of contributors	+
Distribution of contributors	+++
Contributor history	+

Conclusion

Constraining the number of one’s dependencies is a lofty goal. However, unless one wants to reinvent the wheel, it’s often better to reuse existing projects in non-trivial software development efforts. Besides widespread and famous projects, one must err on the side of caution as depending on a codebase outside one’s control. I hope this post will help programmers to make informed decisions regarding code dependencies.

Originally published at A Java Geek on May 3, 2020.