If in doubt, list it out

#schema #domain #migrations #analytics

Today (and for the last few weeks), I've been dealing with a bad technical decision I made about three years ago, which has suddenly come back to bite me, as the hidden requirements have suddenly come to light. I want to walk through my situation, and how I'm learning the painful lesson that it's best to enumerate domain objects rather than mush them into a lossy aggregate.

Background

I work on an inventory management platform for consignment stores — when I was hired to build the system from scratch with no prior domain experience, it sounded simple.

It's not simple. Consignment normally works with unique inventory, except when it doesn't. In addition, you have two models of paying suppliers living side-by-side: traditional retail where you pay your suppliers when you acquire the inventory (or thereabouts), and consignment, where you pay your suppliers when the inventory sells.

To make this worse, it's possible for both payout models to be applied to a single SKU, either for the same instance of an item, or for multiple instances over time.

Where I went wrong

In the name of "saving space", I compressed every item in the system into a single row in the items table.

There's nothing necessarily wrong with this, as we keep an event log that lists every update to an item, so we can get at the item's state as of a point in time (e.g., when it was taken into inventory, or sold). We also keep various special purpose logs, including balance accruals and item status changes to help us decouple supplier payout state from the current state of the item.

But the code started to smell when we had to shoehorn the concept of "status" (a concept having to do with unique inventory) into our plural-cardinality model: is this item sold? lost? in stock?

I accommodated this with a jsonb field in our database that kept track of how many items for the SKU could be attributed to a given stock, e.g., {"active": 3, "sold": 1, "lost": 1}. I knew it was a bad design, but I couldn't predict how it might cause problems, so I left it.

Discovering my mistakes

I'm now writing customer-facing inventory analytics code, which includes calculating Cost of Goods Sold. Because the state of any given enumeration of an item has been lost (consignor split, price, and cost) due to subsequent edits, I now have to add a new table called "item_sales", which lists the attributes of an item needed to calculate COGS as of the time of the sale, and fill it by replaying the relevant events.

Also, because we need to support partial refunds, I can't compress these records into one per sale, with a record of quantity; I have to mark each enumeration of an item as refunded independently.

So I'm left with the design I tried to avoid in the first place, except it's ad-hoc and very difficult to maintain and understand. If I had enumerated every item instance in the first place, I would have had all the COGS data as of the sale date. And once an item sold I could have "locked" those instances to avoid mutating the data even if its un-sold brethren were modified.

Conclusion

We left designing analytics until we already had two years worth of customer data. Analytics is pretty much your worst-case scenario for understanding your data, and contrasts strongly with a CRUD-paradigm view of what's true now with no regard for history. Even an event log (though useful for migrating to a new schema) won't save you, since events often have to be procedurally re-played rather than simply aggregated.

Learn from my mistakes — if in doubt, list it out. It's as Rich Hickey says when evangelizing for Datomic: storage space is cheap, developer time isn't. Keeping redundant copies of your data is redundant until it isn't. You have to understand what's possible in your domain, and point your design in a direction that will handle that from every angle.

DEV Community

If in doubt, list it out

Background

Where I went wrong

Discovering my mistakes

Conclusion

Top comments (0)