DEV Community

JohnN6TSM
JohnN6TSM

Posted on

A Tale of Two Codebases (Part 3 of 4): Code Reuse

As I discussed in Part 1 the premise of this series is a simple natural experiment: comparing 2 large codebases written by the same solo programmer before and after introduction of SOLID Design principles. PhotoDoc, the pre-intervention project, is an electronic medical record dedicated to medical forensics. [Melville.PD(https://github.com/DrJohnMelville/Pdf) is a free, open-source PDF renderer for .NET. In this article I discuss code re-use.

Bob Martin claims that “Duplication may be the root of all evil in software.” (Martin, Clean Code One of the promises of SOLID design is to divide code into reusable units. The insight of the SOLID principles is that classes are more reusable if they do only one thing. Even if I must chain several classes together to get the behavior I want, it is much easier to accumulate the desired behavior from multiple objects than to refuse an unwanted feature from a big and chunky class.

Lesson 1: Large classes are difficult to reuse.

Large classes make reuse difficult. I approached the design of PhotoDoc such that all the code that touched an object’s data was within that class. If data was used to do more than one thing, then the class acquired multiple responsibilities. PhotoDoc’s name implies its original purpose -- to analyze digital photos. There is no surprise that a Photo class exists. Photos can load themselves from a disk file, display metedata, manage a rich collections of pixel shaders that filter the image, and display a collection of tools like lines arrows, and text that annotate the image. Using PhotoDoc’s “folders of sheets” metaphor, a
PhotoSheet class holds a collection of photos.

Years after PhotoDoc implemented PhotoSheets, I added support for scanned documents. Scanned documents typically come in as PDF, XPS, TIFF, or DICOM files, but fundamentally they are a sequence of images. Over the years I have added metadata display, filters, tools, and annotations to the scanned documents as well. You would expect that PhotoSheets and ScannedDocuments, both representing a series of images with annotations, filters, and etc, would share almost 100% of their code – except they don’t.

Like I said earlier, Photo and PhotoSheet are chunky objects that reflect my 2007 understanding of objects that map directly onto concepts in the real world and directly implement all the behaviors of those objects. Photo classes know how to load themselves from a disk path, and use that path as a key to prevent reloading when a photo is used more than once. For scanned item pages images do not have a unique path – they have a path plus a page number. They get loaded using different libraries and are cached differently. While I could refactor this to a better design, it would be expensive and, so far, the problem has not been big enough to make it into active development – I suspect it never will.

By all rights, PhotoSheets and ScannedDocuments should just be alternative views of equivalent data structures. In fact, they remain very different. Photos can be associated with an injury noted on a traumagram, and scanned pages can be associated with a line in a document index, but not vice-versa. I even implemented features to turn a sequence of photos into a scanned document, and another feature to create a PhotoSheet from the pages of a scanned document. Chunky objects make even this obvious code reuse a higher cost refactoring than I have been willing to execute.

Lesson 2: The Single Responsibility Principle Facilitates Unexpected Code Reuse

In contrast, small classes with single responsibilities plus the right abstractions create happy coincidences where objects just fit together in unanticipated patterns. One of those happy coincidences happened when I was working on content streams.

Content streams are a domain specific language which PDF uses to describe the appearance of visual elements. Melville.PDF, obviously has no choice but to implement a content stream parser that takes a content stream and produces the rendered page. The “and” in the preceding sentences is a giveaway that these were two different responsibilities, and the Single Responsibility Principle dictates that they be represented as separate classes.

Thus Melville.PDF’s content stream parser design emerges from these requirements. The IContentStreamOperations interface contains one method for each legal combination of opcode and parameters in the content stream DSL. The ContentStreamParser class takes accepts input bytes from a pipe and calls the corresponding methods on a IContentStreamOperations instance passed in the parser’s constructor. This design separates the
concern of interpreting the bytes of the content stream from the concern of executing a sequence of drawing commands.

Separately, it became clear that PhotoDoc would need a nontrivial library of test documents, and so PDF generation capabilities were needed to support the test code. The ContentStreamWriter class implements IContentStreamOperations and responds to various method calls by writing the equivalent content stream code to the designated output stream. The ContentStreamWriter separates the concern of designating which content stream actions should be produced from the concern of generating the correct syntax of those operations. Because C# requires method calls to have the proper number of correctly typed arguments, proper PDF syntax is enforced by the C# compiler and supported by C# intellisense.

Late in the development of Melville.PDF, a need developed to pretty print content streams. Debugging a renderer is a loop of 3 steps. 1. Find a document that renders different in Melville.PDF than Adobe Reader. 2. Understand what that document is doing to cause the different rendering. 3. Fix Melville.PDF to render the file the same as Adobe Reader. Step 2 involves reading many content streams that often span thousands of lines. Because PDF is a binary format not intended for human consumption, most content streams exist in a very concise format that resembles minified javascript, even though whitespace is ignored in content streams, so an indented rendering is allowed. I wanted a pretty printer to produce indented, easy to read content streams.

Conceptually, a pretty printers and code minifiers both can be thought of as a parser that outputs to a code generator that generates the source language. Conveniently I had a parser that outputs operations to an interface and a code generator that implemented the interface my parser targeted. Thus, creating a PDF minifier was trivial – I just had to pass the ContentStreamWriter to the ContentStreamParser and let it run.

Converting my dirt cheap minifier to the pretty printer required only the creation of an IndentingContentStreamWriter to implement the “pretty” part of pretty printing. IndentingContentStreamWriter runs just over 100 lines of code, most of which is spent designating which constructs should begin or end indented regions. IndentingContentStreamWriter delegates writing operations to a contained ContentStreamWriter so the class itself is focused on the single responsibility of adding whitespace to the content stream output. The entire process of implementing this feature took about 30 minutes.

The content stream pretty printer is a developer convenience that users of the library will never see. The feature was feasible because it was so unbelievably cheap that efficient debugging more than paid for the minimal development cost.

The joy of clean coding is that these serendipitous opportunities for code reuse become more, rather than less, frequent as the project continues. In Melville.PDF big endian bit binary integer parsing from the ICC parser got reused in the CCIITT and JBIG parsers, the entire CCITT parser got reused in the JBIG parser, and a byte stream to bit stream adapter was repurposed from the LZW parser into the binary image parsers. As the development progresses, the developer accumulates a library of small classes withlities that form a toolbox uniquely customized to the problem domain.

Conclusion

SOLID code is reusable because classes are small, and they do one thing. Problems tend to recur within a problem domain. In reuse scenarios, augmenting a small class is much easier than removing undesired “features” from a larger class. When classes are small and focused, the probability that the next problem encountered can be solved with code that already exists and works increases with time.

The last post in this four part serries will address dependency management.

Top comments (0)