Davyd McColl

Posted on May 21, 2020

Keeping DRY

#pragmatic #programmer #duplication

One of the earliest principles we learn about in programming is DRY:

Don't
Repeat
Yourself

It's most commonly referenced in the scope of repeated code, but this doesn't just apply to code. Let's start there anyway 😅

DRY Code

Fundamentally, we'd like to not repeat code:

it takes time and effort to write and test
it would mean that any updates to the code have to be applied in multiple places. Bug? Fix multiple times. Enhancement? Alter in multiple places. Apart from the waste of time, there's the concern that we miss one of the implementations. Parallel implementations of the same logic tend to drift apart over time

When to refactor common code

Before refactoring out that code that looks so common, first consider the rule of three. The idea here goes:

Write the first implementation
Write the second implementation. Notice the similarities. Don't refactor just yet otherwise you risk refactoring out an implementation which is difficult to extend going forward, unless the problem scope is at least one of:
- very simple (eg FullName => FirstName + " " + LastName -- and even here, there are some edge cases, but it's a good starting point)
- quite abstract (eg hash algorithm)
- very well known (eg credit card number verification)
As you're starting on the third implementation, revisit (1) and (2). With the knowledge you have about the process right now, start refactoring out the common logic. Apply this refactored version in the three places you need it.

When code duplication is not knowledge duplication

The Pragmatic Programmer gives the following example:

Given the customer constraint that a user's age must be a positive integer and that the quantity that they are ordering of a product must also be a positive integer:

function ageIsValid(value) {
  return typeof value === "number" &&
         value > 0;
}
function quantityIsValid(value) {
  return typeof value === "number" &&
         value > 0;
}

The duplicated fragment of code is obvious. The book warns, however, that this is not duplicated knowledge, as ages and quantities are different bits of information. And, to a point, the book is correct. However, I'd probably want to take this one step further. I think that the following refactor is a good idea:

function ageIsValid(value) {
  return isPositiveInteger(value);
}
function quantityIsValid(value) {
  return isPositiveInteger(value);
}
function isPositiveInteger(value) {
  return typeof value === "number" &&
         value > 0;
}

And, right now, you could even inline that where these validations are used, as there's only one validation for both. Conceptually, they're different, so having dedicated functions for them abstracts away the details from the higher-level code. Also, it makes it much easier, later on, when the company has to validate that a user's age is within a certain range -- eg over 13 to be able to use the site -- and order quantities are fulfillable:

function ageIsValid(value) {
  return isPositiveInteger(value) &&
         isOldEnoughToUseSite(value);
}
function isOldEnoughToUseSite(ageInYears) {
  return ageInYears >= 13;
}
function isValidQuantity(value) {
  return isPositiveInteger(value) &&
         haveEnoughStockFor(value);
}
function haveEnoughStockFor(orderQuantity) {
  // ... imagine an implementation here (:
}

Your de-duplication refactorings should keep foremost that code is for co-workers, not compilers, so one of the primary aims would be to make the code easier to understand -- which makes it easier to maintain.

DRY Documentation

The wars over code comments are like tabs-vs-spaces or vim-vs-emacs: everyone has something to say, there are some strong opinions and few amicable outcomes from discussions on the topic 😅

There are people who say you should document everything -- comments all over the code and jsdoc or xmldoc or docstrings which clearly state the purpose of every function in your code.

There are other people who believe that all comments are bad.

I sit somewhere in between: If you're commenting on the what, then it's not useful. If you're commenting on the why, then it can be really useful.

Take this example from the book:

/**
* Calculate the fees for this account
*  - Each returned check costs $20
*  - If the account is in overdraft for more than 3 days,
*    charge the account $10 for each day
*  - If the average account balance is greater than $2000
*    reduce the fees by 50%
*/
function calculateFeesFor(account) {
  let fees = 0;
  if (account.returnedCheckCount > 0) {
    fees += account.returnedCheckCount * 20;
  }
  if (account.overdraftDays > 3) {
    fees += 10 * account.overdraftDays;
  }
  if (account.averageBalance > 2000) {
    fees /= 2;
  }
  return fees;
}

The commentary at the top of the function clearly duplicates the logic within the function. It doesn't add to the readability or understanding and it's likely to go out of sync with the function as the internals are changed. Have you come across code where the comment says one thing and the code itself does another? I have. The problem is that new programmer Bob is tasked with changing one business requirement: people are to be allowed more free overdraft time, and the free window won't apply to fees. So Bob updates the one part:

  if (account.overdraftDays > 5) {
    fees += 10 * (account.overdraftDays - 5);
  }

and forgets to update the comment.

So let's suggest some changes.

First off, there are a lot of "magic values" in this function. I'm talking about constants (like the 5 above) which should really have a name to communicate what they are to the reader of the code. If we can get the code to explain all of the what with good names, we're left with only having to document the why, if that's even necessary.

const RETURNED_CHECK_COST = 20;
const OVERDRAFT_FEES_PER_DAY = 10;
const OVERDRAFT_FREE_DAYS = 5;
const VIP_MEMBER_BALANCE = 2000;
const VIP_MEMBER_DISCOUNT_PERCENT = 50;

function calculateFeesFor(account) {
  let fees = 0;
  if (account.returnedCheckCount > 0) {
    fees += account.returnedCheckCount * RETURNED_CHECK_COST;
  }
  if (account.overdraftDays > OVERDRAFT_FREE_DAYS) {
    let chargedDays = account.overdraftDays - OVERDRAFT_FREE_DAYS;
    fees += OVERDRAFT_FEES_PER_DAY * chargedDays;
  }
  if (account.averageBalance > VIP_MEMBER_BALANCE) {
    fees *= (VIP_MEMBER_DISCOUNT_PERCENT / 100);
  }
  return fees;
}

Which is way better -- we gave those magic values names. We can make this read easier though, with a second round of refactoring:

function calculateFeesFor(account) {
  let fees = returnedCheckFeeFor(account) +
             overdraftFeeFor(account);
  fees -= vipDiscountFor(account, fees);
  return fees;
}

function vipDiscountFor(account, fees) {
  if (account.averageBalance > VIP_MEMBER_BALANCE) {
    return fees * (VIP_MEMBER_DISCOUNT_PERCENT / 100);
  }
  return 0;
}

function overdraftFeeFor(account) {
  if (account.overdraftDays > OVERDRAFT_FREE_DAYS) {
    // local law states we can't charge overdraft fees
    // during the free period because the advertising
    // is confusing
    let chargedDays = account.overdraftDays - OVERDRAFT_FREE_DAYS;
    return += OVERDRAFT_FEES_PER_DAY * chargedDays;
  }
  return 0;
}

function returnedCheckFeeFor(account) {
  return account.returnedCheckAccount > 0
    ? account.returnedCheckAccount * RETURNED_CHECK_COST
    : 0;
}

And now someone reading through the higher-level function (calculateFeesFor) can see, at a glance, how fees are calculated. When she wants to dig into the details, she can, and each individual function uses well-named variables to communicate the what of what's going on. We can also add the why comment to explain why we don't charge overdraft fees over the free period.

DRY Data

Consider the classes:

class Line {
  public start: Point;
  public end: Point;
  length: number;
}
class Point {
  public x: number;
  public y: number;
}

This is a perfectly valid representation of a line -- it has a starting and ending point, and all lines have a length (even if zero). However, there's a data duplication here in that the two points, start and end, determine the value of the length property.

So when someone updates this line, they would have to remember to update the length too.

Better, then would be:

class Line {
  public start: Point;
  public end: Point;
  public get length(): number {
   return start.absoluteDistanceTo(end);
  }
}
class Point {
  public x: number;
  public y: number;
  public absoluteDistanceTo(another: Point): number {
   const xDelta = end.x - start.x;
   const yDelta = end.y - start.y
   Math.sqrt((xDelta * xDelta) + (yDelta * yDelta))
  }
}

In the above, there is one source of truth for the length of a line, as determined by the absoluteDistanceTo method on Point.

If performance is a concern, then we start getting clever with property get/set methods, updating the length only when start or end are modified, and returning the cached length value:

public class Line {
  public get start(): Point {
    return this._start;
  }
  public set start(value) {
    this._start = value;
    this._updateLength();
  }
  public get end(): Point {
    return this._end;
  }
  public set end(value) {
    this._end = value;
    this._updateLength();
  }
  public get length() {
    return this._length;
  }

  private _start: Point;
  private _end: Point;
  private _length: number;
  private _updateLength() {
    this._length = this._start.absoluteDistanceTo(this._end);
  }
}

DRY Representation

It's inevitable that representations will be duplicated. The simplest example is perhaps a web front-end (I'm not picking a side in this article! 😅) for a web-api back-end (again, no sides 😁): both sides of that api (or contract) have to have knowledge of that contract to be able to work. If a new client is added (say, a mobile app), then that information has to be duplicated again in the new client.

This kind of duplication can't really be avoided -- the api has to know what contract it should stick to and the client has to know what contract the api provides.

However, the burden of this repeated information can be lightened considerably by having one source of truth and generating the other sources of reference information.

For example, defining your api in Swagger and then having your api conform to it whilst it generates the schema for that contract (as well as validating information crossing that boundary) means that when a change happens, everyone knows what the change is and can be adapted to handle that change as Swagger will produce a document defining that change and there is even tooling to generate client-side interfaces if you want.

If you must have repeating data / code / representations, spend the time to figure out a way to have one source of truth for that data.

Sharing is caring

The book speaks of a Y2K audit on government software which found that the 10000 programs audited all had their own internal (often subtly different) mechanisms for validating a social security number. This seems like a good place to be DRY -- but going DRY across multiple programs and teams requires:

Good inter-team communication

The first is quite important: how will I know if a team-member (even if on a different team in the company, we're all part of some larger team) has implemented something that I need if we don't communicate? Daily stand-ups are a good starting place, but so are dedicated times for people to "show and tell" what they've been working on or have learned recently. Perhaps Jane in the European division has come up with a novel way to solve a problem. Or perhaps Karen has just discovered a library which does stuff that I'm about to write by hand. Providing a platform for people to share this knowledge is crucial to effective success in the company.

In a way, dev.to is like this -- for a larger community of people I might never meet! On that scale, it's also our responsibility to promote open-source initiatives which solve common problems so that people are spending more time solving their own interesting problems and less time solving common problems. We have a limited amount of time we can spend on tasks, so we should choose how to use that time wisely

Common methods for sharing that common code or knowledge

If you can, put that code up on Nuget, NPM or whatever is the default package manager for the ecosystem you're working in. Always ask your manager if you can open-source stuff which you feel may benefit the community at large. If the code you'd like to open-source isn't specific to your company, it shouldn't be too hard to motivate: we all benefit from open-source, so it's good to give back.

As an added bonus, your functionality will be available with minimal friction to others.

If you can't motivate for public package hosting, then look into private hosting. NPM offers paid private hosting, or you can use Verdaccio. Private nuget hosting can be done with MyGet. At the very least, you could use the git submodule feature to share code between teams: extract out the common code and put that into a git repository which people can import via a submodule!

Make it easy to use

At the end of the day, whatever choices you make in helping to keep things appropriately DRY, you have to remember to keep it easy to use. Where possible:

have a single source of truth for rules and business data
find a way to share common application logic via easily-installed packages
share your learnings and creations with the people around you

DEV Community