In refining a DevOps process, it can be difficult to determine if a change had a positive effect. Continuously measuring and analyzing metrics can show a change is moving the system in the right direction. Before you go off and start counting lines-of-code or hours spent, let's talk about meaningful metrics.
Back in 2019, Google and few other heavy hitters put together a study detailing measurements high performing organizations shared. The results are available in the State of DevOps. It is worth the read, but if you need a good summary of the metrics, my co-worker has put one together here at Dev.to. While the metrics themselves are interesting, the reasons for them are just as intriguing. The four metrics are as follows:
- Lead Time for Change : throughput of software delivery process from check-in to release
- Deployment Frequency : how often code is released to production
- Time to Restore Service : time to restore service when a service incident or a user impacting defect is detected
- Change Failure Rate : percentage of releases that result in degraded user experience that requires remediation
These measurements work together to provide four legs holding up a stable platform that is your application. Trying to optimize one metric at the expense of the others will have negative results.
Lead Time for Change seems to be the one most organizations want to improve the most. It sounds good on paper, if we get more changes in, we are more responsive to change and it makes the application better. The problem arises when you are pushing for faster changes, shortcuts are taken. Tests are not written, code written is not extensible, and more technical debt is accrued. On some occasions, the new change can cause an outage due to a missed edge condition.
Change Failure Rate and Time to Restore Service tempers Lead Time for Change. By keeping the Change Failure Rate low or constant as you reduce your Lead Time For Change, you can maintain a good quality of code and enjoy a dynamic application. A good Time to Restore Service measurement provides a good safety net to make the changes.
Deployment Frequency is another metric that seems easy to improve. There are always changes in the pipeline, just push them out as they are done and the frequency increases. Again in our push for a better number, substandard code can be deployed as checks and balances are compromised for a single, better score.
Having a good Time to Restore Service number can mitigate having the site down for a while, but the user experience is still degraded through unplanned down time and lack of new features. Additionally, these rollbacks will increase the Change Failure Rate. A judicious increase in Deployment Frequency can compliment a reduction in Lead Time for Change as now there is a shorter window to introduce changes.
Time to Restore Service seems like it could stand on its own. After all, a robust infrastructure and good roll back strategy can mitigate almost anything that can disrupt service. Highly available, redundant clusters can allow an application to survive the destruction of the primary host datacenter. Operations could create that dream infrastructure, put in strong change controls and procedures, and the Time to Restore Service number is now sub-second.
This is a DevOps article, and Development and Operations work together. Overly burdensome change controls can be hostile to development efforts and negatively affect Lead Time For Change and Deployment Frequency. We have already discussed how this measurement can rein in overzealous increases in Deployment Frequency and decreases in Lead Time to Change.
Finally, a good Change Failure Rate is a strong indicator of high quality code. While not as attractive to the bottom line as the other indicators, it is something Development could drive on their own. Introduce some automated testing, static code analysis, code linting, documentation and you have some great code, right? These are all good, but a single minded approach to test coverage can lead to regression test runs that last hours. Couple this with a requirement to complete a regression for every merge and the Lead Time For Change number is hurt, which also affects the Deployment Frequency.
When measuring the performance of your DevOps process, it is easy to be overwhelmed by all the possible measurements. The four measurements reviewed here are a great place to start or even finish, but for best results, it is important to use all these measurements together.