Today, collecting data produced by the software we use daily is pretty straightforward. Reporting is getting increasingly simpler as well. What often goes up the wrong alley is knowing which data to collect and how data and reports are interpreted and used.
This article outlines the pitfalls with commonly used software development metrics and what we may use as a more impactful alternative.
In software development, you probably saw these metrics pass by at least a few times:
- Number of story points per sprint (”velocity”)
- Average amount of git commits per day
- Number of pull requests or reviews per week
- Percentage of code (test) coverage and its trend
- Number of resolved tickets or bug reports per month
These are “traditional”, output-based metrics, easily deducible from the tools and software development teams use daily: Jira, Trello, or their choice of issue tracker, GitHub or GitLab, and language-specific code coverage tools such as PHPUnit, pytest-coverage or
All of these metrics can be interesting in their regard, given they are used in the correct context and with the proper interpretation. Unfortunately, teams or managers, often due to pressure and lack of time, can be tempted to use these (and only these) metrics to draw the wrong conclusions about the performance of individuals or teams or how they compare to others.
With data and output based metrics used to measure and track people’s performance, I believe there are at least two significant issues:
- When aware of these metrics, people can influence them to make themselves look good.
- You never have enough measurable data to shed a complete, trustworthy view.
The effect of people “gaming the system” is also called Goodhart's law, coined by British economist Charles Goodhart in 1975. It is defined as:
Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes. [source]
In essence, people can (un)consciously influence the outcomes of the metrics in ways that don’t have the intended effect.
For example, an easy way to increase the percentage of code coverage is to add tests that perform little to no assertions. Code coverage may go up, but you gain next to nothing as the application’s behaviour or results are unchecked. If your goal as a manager was to increase quality and avoid regressions, you are in for a treat.
“But what if we combine many metrics to get a complete overview that cannot be gamed?”
While sounding tempting, not all aspects of software development are measurable with quantifiable data or metrics. Even more so when working with teams, software development is a complex socio-technical system in large part based on human interactions.
Traditional output-based metrics, incorrectly equating “quantity of work” to value, also fail to encapsulate actual business value, impact, and client happiness. Hence, they are unfit for reporting back to investors, clients and other stakeholders.
Then there is the ethical aspect of whether you should at all measure and track individual people’s output. I believe tracking (and rewarding) people individually brings more bad things: decreased cooperation, declining psychological safety, more gossiping and politics, and higher risks of burn-out.
A possible alternative is twofold. First and foremost, measure and optimise the development workflow rather than people or teams’ output. Then, start assessing the team’s business impact.
The efficiency and effectiveness of our development workflow are huge enablers to making an actual business impact. That’s why I always focus on this one first.
The easier a team can experiment, adapt to changes, fix issues and resolve incidents, the better they can focus on building cooperation and team spirit and creating that value and impact, positively affecting team stress levels and happiness.
In the Lean sense of Kaizen and continuous improvement, let’s look for some core metrics that we can use to assess our engineering workflows. Luckily for us, Google, in their DORA research, already defined a few base metrics called the Four Keys that modern DevOps teams can use:
- Deployment Frequency: how frequently do we release to production successfully?
- Lead Time for Changes: what is the lead time for changes to hit production?
- Change Failure Rate: what percentage of production deployments cause a failure?
- Time to Restore Service: how long does it take to recover from such failure?
Notice how these metrics do not measure individual output nor incentivise teams to get more of the same (bar number of deployments). Instead, they give us an idea of waiting times and bottlenecks when building and deploying solutions, how often broken builds are delivered, and the rate of and how quick we handle regressions after a production deployment.
To collect this data, you may go with Google’s proper software tool or start collecting them MVP wise:
- Deployment Frequency: check your CI/CD tools to see if you can read or export the number of deploys. If not, you may start by manually posting a message in a Slack/Teams/… channel after you deploy and track the number of these messages. I’ve found this also helps with “leading by example” to perform deploys if your team does these manually multiple times a day. Don’t forget to measure production deployments only (internal or testing environments don’t count) and subtract failed ones.
- Lead Time for Changes: the median time between the production deployment date and the dates of the (git) commits therein. If your CI/CD tools do not provide this out of the box or through a plug-in or script, this is considerably harder to keep track of manually. As an alternative, you may measure the cycle and lead times of stories instead (see Intermezzo below).
- Change Failure Rate: tracking the occurrence of production deployments or feature releases (i.e. enabling a feature toggle) that require remediation through reverts, fail-forward fixes, etc.; what percentage of releases fail? Here I recommend logging incidents in an issue tracker or purpose-built tool such as Opsgenie.
- Time to Restore Service: closely related to the previous, how long does it take to restore the application to a fully working state? Aside from issues triggered by deployments, consider other incidents such as degraded performance, service downtimes, etc.
Having collected these numbers, you can compare your team's performance to Google standards.
You might have heard about lead time in a different context: Lean. These terms are so valuable that they, too, deserve to be mentioned here.
In short, we can define lead time as the total time it takes for a (customer) request to pass through the entire processes and value stream up until it is released (to the customer). While DORA metrics measure commit lead times, measuring total production lead time gives a more broad indication of the performance of our product & engineering team’s processes.
The cycle time of a change, on the other hand, can be measured by comparing the release (to the customer) of the change with the moment the development team started working on it.
What constitutes a “request” or “change” is up for interpretation. Most likely, it will translate to a user story or change request that you add to your sprint backlog or Kanban board. Measuring cycle time becomes pretty easy, given the end state of the board equals “released to the user”, and the tool at hand allows to visualise or export the time differences between states.
As mentioned before, assessing and communicating on business impact is much more valuable to company leaders, investors and other stakeholders than technical/operational metrics but also more challenging to measure than the latter.
Working towards outcomes (over fixed output), evaluate with the team whether they meet their goals and if they align with the company’s strategic goals and mission. How you may determine or measure this value and impact dramatically depends on the company, kind of client, and more:
- Communicate with your clients or users! Are they happy with the product? Do their business needs and expectations get met? Is their feedback asked and taken into account?
- Does the team define an outcome-oriented product roadmap? If so, evaluate whether the team met the initiatives or goals.
- Does the company define company-wide, outcome-oriented objectives and key results (OKR)? If so, this may also help with making that assessment. (Do beware that implementing OKRs comes with its pitfalls.)
If a team is struggling, overwhelmed, or otherwise fails to make the desired impact, the first step is to talk to and (as a manager) coach them into resolving any issues.
You guessed it. Having open conversations — within a safe environment — is more valuable than (just) measuring with data and comparing teams or individuals. And by all means, continue to focus on their happiness and well-being as a whole. Happy employees lead to increased productivity, which leads to more satisfied clients and contributes to a more durable way of scaling the company.
Designing, developing, and releasing software applications is a complex undertaking. To succeed in this, we cannot afford to treat a team of engineers — or any team — as a silo. Doing so severely limits the quality and effectiveness of the solutions built and might even set up a development team for failure from the start.
This implies that — even with the improved approach to metrics and value assessment — the interplay with and influence of other teams can severely impact the success of a development team.
To start with, UX and UI analysts and experts, designers, QA engineers and testers, product managers, system engineers, and product and data analysts all contribute to the success of a solution. They should be internal to the team as much as possible.
Additionally, the other departments or teams of the company also have a significant impact, even if only for their influence on company culture: sales and marketing, customer success, finance, etc. And, of course, not to forget the quality and impact of the mission, vision, and strategies set out by the leadership team.
When setting up any initiative for improvement, even within a single team, take the entire system into account. Is there a deeper root cause? Do we require a more considerable culture shift? Or is this a purely operational issue that a team can fix themselves?
- The Fallacy of the 100% Code Coverage, an article where Thierry de Pauw describes how the test coverage metric can be misused.
- Google Cloud’s DevOps Research and Assessment (DORA) brings reports on the state of DevOps and is the origin of the “four keys” metrics.
- GoogleCloudPlatform/fourkeys is a software tool built by Google to generate insights from data based on the four core metrics mentioned above.
- Definitions of cycle and lead time as provided by Lean Enterprise Institute.
- A Two-Person Agile Project (and what it teaches us) explains the core elements of an effective product or project development team.