Daniel

Posted on Apr 19, 2022

Solutions for storing state in a database

#database #mysql #state #architecture

When designing an application, you will often need to model 'state'. State, in this sense, is related directly to a single entity. An example may be:

I have an order, and the possible status' of the order can be:

Pending
Paid
Dispatched
Delivered
Cancelled
Refunded

The order may only be associated with a single state at any time. It would not make logical sense for an order to both be delivered and returned.

Transitioning between these states will drive the behaviour of your service. For example, when an order transitions to dispatched, it is an ideal time to notify the customer of the delivery tracking details.

When modelling this type of solution, engineers are often tempted by the simplicity of having a single column on the original entities table to describe the current state. This can be achieved in a few different ways.

First Solution (and the worst)

As we just discussed, the easiest solution is to add a column to the Orders table. This column can store the current state of order.

It can be achieved by storing the state as a string:

CREATE TABLE `orders` (
 `id` INT unsigned NOT NULL AUTO_INCREMENT,
 `customer_id` INT unsigned NOT NULL,
 `order_value` INT NOT NULL,
 `state` VARCHAR NOT NULL,
 `created_at` DATE NOT NULL,
 `updated_at` DATE NOT NULL,
 PRIMARY KEY (`id`)
);

This isn't good for a lot of reasons. The most obvious one is that the application can submit any string to the state column, even one that has not been designed for in the application. This can break the application and cause problems for the service.

We could solve that issue by making the state column an enum and only allowing expected values.

ALTER TABLE orders ADD state enum('pending','paid',...);

This is still a bad solution.

We have to run an ALTER on the table every time we wish to add new states.
We cannot store additional metadata about the state as we are limited to a single column with a specific string stored.
It can be challenging to disable values. What if the store is no longer offering refunds?
ENUM datatype is not standard SQL and is not supported by many ORM.
It is complicated if we wish to output the list of possible values for state in our application.

Slightly Better Solution

It is clear from the previous approaches that having a single column denoting the state is not ideal. The next logical solution we could design is creating an order states table and referencing it from the orders table via a foreign key.

CREATE TABLE `order_states` (
 `id` INT NOT NULL AUTO_INCREMENT,
 `key_name` VARCHAR NOT NULL,
 `created_at` DATE NOT NULL,
 `updated_at` DATE NOT NULL,
 PRIMARY KEY (`id`)
);
INSERT INTO order_states (key_name)
  VALUES('pending', 'paid',...);
ALTER TABLE orders 
ADD CONSTRAINT FK_OrderState 
FOREIGN KEY(state_id)
REFERENCES id(order_states);

This solves some of the previous issues. There is no need to run ALTER on the table when adding new states, and any additional metadata that needs to be stored against the state can be added as a new column in the order_states table. If a state needs to be disabled/removed, it is easy to select orders by state_id, and we also get the protection of valid states via foreign key constraints.

Best Solution

The solution we just implemented has solved most of our problems but we have issues with the visibility of an entities state over time. Because the order is assigned a state via a single FK reference it means we only have information about the current state of the order. If the order is currently in a state of Cancelled there is no easy way for me to tell when it was Paid.

We can solve this by adding another pivot table.

CREATE TABLE `order_state_history` (
 `id` INT NOT NULL AUTO_INCREMENT,
 `order_id` INT unsigned NOT NULL,
 `order_state_id` INT NOT NULL,
 PRIMARY KEY (`id`)
);

In addition to creating this table we will remove the state_id column from the Orders table.

This pivot table works as follows: Any time you wish to change the state of an order you INSERT a new record into this table. You never need to UPDATE any rows here. You will also likely never need to DELETE any rows (unless you are purging records).

The current state of an order is the most recent record in this table (associated with the order_id)

You can get the order with its latest state by using the query below:

SELECT orders.*, order_state_history.*
FROM orders O
JOIN order_state_history OSH1 ON (O.id = OSH1.order_id)
LEFT OUTER JOIN order_state_history OSH2 ON (O.id = OSH2.order_id AND (OSH1.created_at < OSH2.created_at OR (OSH1.created_at = OSH2.created_at AND OSH1.id < OSH2.id)))
WHERE OSH2.id IS NULL;

Drawbacks
It would not be objective if I did not at least point out some of the drawbacks of the above solution.

Additional pivot table required linking Order to State
Selecting data from Orders requires a more complex query (including JOINs)
You will end up having to store additional rows for each order. Worst case is Orders * count(possible_state_values)

Summary

The last solution discussed is more robust because it is INSERT only. We do not need to worry about race conditions.

If we receive a stream of updates, we simply write each record in the order we receive it. It also ensures we have a complete historical record of an order and the states it held.

This makes it trivial to look at the data set and ask questions like:

How many orders did I dispatch between 1/12/2021 and the 31/12/2022
What was the total value of orders refunded during August

These questions would have been impossible to answer with the other solutions because we did not retain the history of state movements - only the current one.

The additional code to be written is minimal, and any performance hit from using JOINs is trivial when using proper indexing.

Any concerns about the additional data stored can easily be rectified by running a simple query to remove unneeded state history items. In pseudocode terms -

Delete all order_state_history items IF created_at > 1 year ago AND
order_state_history_id < MAX order_state_history_id for any given record

Running a query like this would remove historical items (more than one-year-old) but retain the latest state for each Order (as this will remain important)

In conclusion, it is a much better and more complete way of managing the state in your application and should be utilised where plausible.

Top comments (4)

Akash • Apr 21 '22

This was very interesting and much needed. I was looking for an article like this. You should do more of these. It would nice if you can even go into how indexing can be done, in a similar manner this article is written. Is there a robust framework or like a series of specific questions that one can ask himself when designing a db schema? For eg - the questions you mentioned that would be impossible to answer if we hadn't stored history of state movements and perhaps used pivot tables.

Daniel • Apr 24 '22

Thanks Akash ! I will definitely post more on similar topics in the future.

I don’t follow any particular framework when designing tables. I’m sure there are some out there though !

For me I just ensure the data is normalised - and I try to imagine how each data might be accessed in the future.

One general rule (which is applicable above) is to try not to store a calculated value - rather store the events that make up the calculation. For example if I were to make a bank system. If I wanted to get the balance it might be tempting to have a balance column. What is even better is to calculate this based on the sum of transactions.

Akash • Apr 24 '22

but what if that calculated value is being required in few or many places, wont calculating it each time add to the response time? is it still viable to not store a calculated value and rather store a db event? is calculated value being accessed multiple times good enough reason to store it in db? like in your case, calculating the account balance by summing up the transactions that took place, wont that be too intensive cause practically there is no upper limit to the number of transactions that can happen.

Daniel • Apr 24 '22

Absolutely correct. Calculating the stored value takes computing resource. That’s why when searching/filtering you used a stored value cache.

In my specific example you can use something like a statement period. When a new statement is generated you persist the balance. When calculating the current balance you take the stored and sum any transactions since. This prevents any editing of previous financial data. It also means in the future you can even archive historic data.

The benefits are similar to the ones discussed in the article. Write only transactions. Race conditions become impossible. Full audit of an account and it’s balances at any one time. This is another article in itself :)