JohnN6TSM

Posted on Oct 5, 2022

A Tale of Two Codebases: One Developer’s Reflections on SOLID Software Design (Part 1 of 4)

#csharp #cleancode

I suspect that my programming background is somewhat unusual – I am a hobbyist programmer with a master’s degree in computer science. Weeks after I finished CS grad school, I entered the UCSD School of Medicine. I have spent the past twenty years as a practicing physician. I develop software in the early mornings, evenings, and weekends, largely for enjoyment, and to build the tools I use during my “day job.”

My formal computer science education spanned 1993—1998. I took the required software engineering classes and was taught straight waterfall design. In my 4 years of CS education, I never heard the words “unit test.” Over the intervening years I have read about test driven development, agile methods, refactoring, and design patterns. More recently, in 2019 I read Bob Martin’s “Clean Code” and was impressed with the SOLID principles.

The purpose of this article is to compare two nontrivial codebases – one legacy codebase preceding my introduction to the SOLID principles and a second, newer codebase featuring an intentional SOLID design. This first post will ask if SOLID principles measurably change my coding practices using objective source code metrics. Next, I will compare unit testing in the two projects. A third article will compare code reuse within the two projects. Lastly I will close with an article about dependencies in the two projects.

Meet the Codebases

PhotoDoc, is a legacy codebase that has grown up with my career. PhotoDoc was born in 2007 when I volunteered as an examiner for the Northwest Arctic Borough Sexual Assault Response Team. It irked me that I was being asked to make complex medicolegal decisions based of photographs, without even the basic image manipulation tools I was used to from my computer science training. My volunteer interest in medical forensics became a career in 2010 when I entered a postdoctoral fellowship in child abuse pediatrics, and PhotoDoc grew along with me. PhotoDoc picked up examination forms, growth charts, x-ray, video, and audio analysis during my fellowship. As I moved on to my current position leading the Child Abuse Pediatrics division at the Medical University of South Carolina, PhotoDoc learned how to talk to many research, billing, and clinical databases that are so essential to my current position.

Despite being developed largely in my free time, PhotoDoc is a working codebase. PhotoDoc is used daily by the approximately 10 people unfortunate enough to report to me. The Department of Pediatrics considers it critical infrastructure for our division. I still maintain the code, but rarely feel the need for new features.

Melville.Pdf came to be in mid 2021. Pdfs are important to PhotoDoc because much of the information I consume at work comes to me in PDFs. Some of the forms the state makes me fill out go out as PDF forms. My reports are printed to PDFs before being sent to partner agencies. PhotoDoc relies on three different PDF libraries, and I wasn’t happy with any of them. I was especially irked that my choices for rendering PDF were either not completely free, closely tied to windows, or hopelessly buggy.

About this time PhotoDoc was transitioning to maintenance mode and so I was looking for a new project. I thought “how hard can it be to render PDF?” then “it sounds like fun, and I can give away a free, .NET PDF renderer.” Github records my first commit on 6/27/2021 and 1,014 commits later Melville.PDF is available of Nuget and Github.

Recently, the development of these two codebases occurred to me as an interesting natural experiment on the effect of SOLID programming. The strength of this experiment is that neither codebase is a kata or a toy – they are each significant codebases with tens of thousands of lines of code. Each codebase responds to significant external requirements and was not intended to be an example of a particular architectural style. The two codebases are each the product of a single programmer (me) so team communication is not a factor, and my raw programming talent probably has not significantly changed in the past decade and a half.

Like all experiments, this one has some limitations. The two codebases do different things, and it could be that one problem is markedly harder than the other. PhotoDoc developed over time in response to shifting requirements, whereas Melville.PDF is coded to a static Pdf specification. Most notably in the 15 years that PhotoDoc has existed, C# itself has been in active development and has become more concise. Especially in the code metrics one must recall that, PhotoDoc is written in a variety of historical styles whereas Melville.PDF is exclusively modern C#. An additional limitation is that maintenance continues on PhotoDoc, including some recent efforts to clean up this legacy codebase. Thus there is some crossover as I have been refactoring PhotoDoc toward a SOLID design.

Comparing the Codebases

This first article concludes with a simple question. Did my decision to adopt clean code result in measurable changes to the codebases?

I used the Visual Studio code analysis feature to compute various metrics for both projects. For each project I excluded unit test code. On the Melville.PDF side I included code that generates test documents and applications written pretty much exclusively to view and debug the rendering output. Test code is not unimportant, but it is different. I may look at the tests in a future part. Because this is an investigation of my personal coding practices, I also excluded the JPEG and JPEG2000 code that I copied from other libraries into the Melville.PDF codebase.

Total Code

PhotoDoc is a much larger codebase at 118,236 lines of code. Melville.PDF weighs in at 46,624 total lines of code. Despite increasingly concise C# notation over time PhotoDoc has a greater proportion of lines containing executable code (29.2% vs 26.4%, p < 0.0001.) This may reflect that PhotoDoc.PDF contains hardcoded data tables for various character mapping schemes. It may also reflect that clean code emphasizes many small methods, and method declaration lines are not executable.

Classes

Regarding classes, Uncle Bob insists “The first rule of classes is they should be small. The second rule of classes is they should be smaller than that.” (Martin, Clean Code pg 136) Martin suggests that classes of less than 100 lines are easy to read, and I adopted this informal guideline in Melville.Pdf. Median class size in Melville.Pdf is smaller than in PhotoDoc (20 vs 29 lines, p < 0.0001.) The overwhelming majority of classes in both projects are less than 100 lines.

If the target for classes is less than 100 lines, then classes over 200 lines are definitely suspect. Despite PhotoDoc being roughly 3 times the size of Melville.Pdf, PhotoDoc contains 101 classes > 200 lines and Melville.Pdf contains 15 (p for difference of proportions < 0.0001.)

Even more interesting than the raw numbers is looking at the 15 instances where Melville.Pdf classes exceed 200 lines. 7 of the 15 classes are essentially dictionaries that contain no significant computation or algorithms. The 4422-line behemoth, for example, maps entries in the Adobe Glyph List to their Unicode equivalents. Other “dictionary classes” include flyweights for name objects used throughout the PDF spec, character mappings, and standardized Huffman tables from the JBIG specification document. Other large classes include a strongly typed property bag for drawing contexts, and a class featuring an unavoidably large switch statement that dispatches drawing commands for the PDF content stream parser. PhotoDoc, in contrast, includes multiple classes that got big because they do a lot of stuff – and they have stayed big despite recent efforts to make them smaller.

Methods

Methods were already small in PhotoDoc, with the median method being three lines of code. Melville.PDF reduced this to 2 lines, which is statistically significant (p = 0.0002) but may not be practically significant. The reduction in median method length remains significant (5 vs 4 lines, p < 0.0001) when simple property setters and getters are excluded.

Melville.Pdf contains only 6 methods with more than 50 lines of code. Two are methods that generate test documents and contain large amounts of quoted document content but a single control path with no loops or decisions. Two methods are unavoidably large switch statements; one dispatches content stream drawing operations and the other maps Unicode to the MacRoman character set. The last two simply create static dictionaries with hardcoded values. None of the long methods contains branching logic more complicated than a single switch statement.

In contrast, PhotoDoc’s 29 methods over 50 lines of code are a mix of “data centric” lookups and several methods with complicated control flow.

Other Metrics

Differences in method and class size are of some interest, but I set out explicitly to trim method and class sizes, so it is not surprising that they changed. The remaining Visual Studio code metrics were not specifically targeted and provide some insight into the effect of clean coding on quality metrics which predate the clean code movement.

Median cyclomatic complexity (1) and class coupling (3) did not differ between the codebases. A slight difference in median value for Microsoft’s maintainability index (86 vs 88) likely has no practical significance. Further exploratory data analysis on these metrics did not reveal insightful differences in the interquartile range, total range, or patterns of outliers for these three metrics.

Discussion

Empiric code metrics confirm that I successfully adopted cleaner coding practices. Both classes and methods are significantly smaller in Melville.PDF than they are in PhotoDoc. Classes shrunk more than methods did in the transition. Most interestingly, other quality metrics did not significantly change.

The size of classes decreased by a larger factor than the size of methods, but this is likely because the pre=intervention methods were already quite small. Subjectively, one of the biggest differences I noted in writing the two projects was stronger insistence on the single responsibility principle for classes. In the PhotoDoc codebase, many classes had “subparts” delineated by #region blocks with extra-private fields that, by convention, should only be manipulated within that block. In Melville.PDF these subparts are separated into their own classes. Melville.PDF also makes extensive use of read only struct types to logically contain related methods without putting additional pressure on the garbage collector.

I was very surprised that cyclomatic complexity, class cohesion, and maintainibilty index did not differ significantly between the codebases. The Melville.PDF codebase subjectively “feels” easier to maintain, as will be discussed in future parts, but this difference was not reflected in objective code metrics. This may be because two of the metrics, cyclomatic complexity and maintainability index, are very method-centric and my pre-intervention methods were already quite short.

Conclusion

This first part has proposed a natural experiment – comparing two large codebases each written by the same single programmer before and after introduction of SOLID software design. In this first part we have confirmed that SOLID design principles have measurably changed characteristics of the code, but that those changes were largely restricted to measures that are directly targeted in the SOLID guidelines. Three classic metrics of code maintainability were unchanged by the introduction of SOLID design.

In the next part I will look at the testability of the two codebases.

DEV Community

A Tale of Two Codebases: One Developer’s Reflections on SOLID Software Design (Part 1 of 4)

Meet the Codebases

Comparing the Codebases

Total Code

Classes

Methods

Other Metrics

Discussion

Conclusion

Top comments (0)

Read next

Advent of Code 24

Advent of Code 2024 - Day 1: Historian Hysteria

NET 9 BinaryFormatter migration paths

Polymorphic Serialization in .NET