Some thought on what event sourcing might be, and what it adds to the table.
In order to be able to write this blog, but not strand to much on what event sourcing 'really' is, I will rely on some trick with words. Instead of talking about 'real' event sourcing I'll quickly go through about what event sourcing in the 'weak' sense could mean, and what event sourcing in the 'strong' sense could be. As often with this trick most systems in the wild are probably somewhere between.
It's unlikely event sourcing in the 'weak' sense will be labeled as such by people using it. Likewise, event sourcing in the 'strong' sense is so strict, it's unlikely a real system hasn't used a shortcut somewhere. It still helps for setting up the story, and the differences for when we go into the advantages of event sourcing in the 'strong' sense.
Event sourcing in the weak sense is taking a very literal approach. Basically all we need is just some events that are stored somewhere for other systems to use. The system producing this stream might not even consume it itself, meaning inconsistencies may occur. It might also not even contain all the events since the service was started. The event also doesn't need to be a business event, it might be something technical, like the update on some row in a SQL database. The events might also not be complete, as there might be related events, but they might not as such be available in the same way.
One of the components that can be used to easily implement such a thing would be to use Debezium. With this we can set up one or multiple rows or collections, depending on the database used, and have all the changes to those available as messages in Kafka.
The most important difference with event sourcing in the strong sense, compared to event sourcing in the weak sense, is that the events should be the single source of truth. In order to generate a new event, a 'command' is issued to the system, and depending solely on the past events, this might generate new events.
One of the ways to work in this manner in practice is by using the Axon Framework. For this to work the events need to be immutable, and we need some way to quickly get all the events related to a certain entity. Based on the past events we can than determine if the command is valid or not. There are some complexities with such a system, like when we need to coordinate between different entities. I won't go into them in this blog.
Let say we have a payment provider. They wanted to be quick to market, so did not spend a lot of thought on designing the architecture. Let's call it 'PEF', for pay easy and fast. In PEF most services are simple REST based services, secured, and allowing all the known CRUD operation when the one calling the service has enough rights. As always with these things any resemblance with any real company is incidental.
In PEF they use a microservice setup. A lot of these services need to have some access to the customer information of PEF, for example to know the bank account number of a customer. Because the customer service was swamped with REST calls as PEF was getting more customers, they decided to start using event sourcing in the weak sense for the customer information. By using Debezium some customer information thus became available in an async matter. This way another service needing to have the bank account number by a customer id, could store and update this information in its own database, preventing a lot of REST calls to the customer service.
Note that what is available is just customer updates, as the changes are made to the database. For example there is not something like a 'CustomerBlocked' event with some context. If a customer is being blocked ot might be just an update where the property 'blocked' is set to true.
I will now get to a few cases where the differences between event sourcing in the weak sense and event sourcing in the strong sense become more evident. As PEF continues to grow, data is needed in other ways than previously was accountant for. Each of these cases provides a challenge to PEF, and depending on how things are set up, these challenges might be easy or impossible to solve.
After some time one of the customers is reaching out why his account is blocked. The one researching this issue doesn't have direct access to customer service database. From the messages available via Kafka he does eventually find out he was blocked three weeks ago. Unfortunately the updated record just says he was blocked from that time, but doesn't include any information regarding the reason. After reaching out to the customer service team, it turns out the reason is not stored in the database either, but it is logged at info level. Logging is searchable, but only for the past two weeks. To get the logs further back a request need to be made to the SRE team, to get the stored logs. After some back and forth they manage to get the relevant logging, and it turns out the customer was being blocked for being inactive. Happy to finally found the answer, this could be feed back to the customer, with some instructions how to reactivate his account, and what he can do to prevent the same thing from happening again.
If we use event sourcing in the strong sense, it would probably have been much easier to found out what was the cause. By the nature of event sourcing it would be easy to retrieve all the event concerning the one customer. Instead of having to search for an update where a property was set from 'false' to 'true' we likely would be able to see something like an 'CustomerBlockedEvent'. In this event we would have the relevant additional data, like the reason, and when it happened.
Via the marketing department, PEF want to build some models on the typical behavior of customers. They are aware for the time they want to run there analysis on, all the customer updates are available via Kafka. They want to compare when customers are onboarded, with which actions are executed in the web UI. Unfortunately it turns out for knowing what customer did in the web UI they can only use access logging. This is quite a struggle as the customer id's are not readable which each request made, because JWT is used. So in turn to correlate a certain call with a certain customer, they have to decode the JWT and extract the customer id. Because this is a different stream of information, a lot of the work for getting the customer information itself can't be reused.
With event sourcing we would have more uniform information to work with, and it would likely be a lot easier for the data scientist to build the model. Also since more information is available as proper events, it might be a lot easier to research similar things in the future.
Since PEF is a financial institution there are some strict laws that apply. One of these is that it should always be clear that if customer data is changed, it should be clear who made the change and way. For example an address might be changed because the customer has moved. This change might be done from the web UI by the customer directly, or indirectly by calling customer support. As it turned out, the current customer service wasn't compliant with these rules, and should be updated soon on the risk of losing the license.
This ment the customer service team had to go through the code base, and add relevant logging with the relevant details to be compliant. The team also added an item to the 'Definition of Done', so when in the future anything is added that updates customer information the relevant logging should also be there.
With event sourcing such nonfunctional requirements are easier to handle in a generic way. For example, we could have something like the 'CustomerContext' which is part of all commands related to the customer. In the resulting events we need to include those, to make sure they are stored with the event. It should not be that hard to add this additional information later on.
While event sourcing in the strong sense might seem to complex at first glance, it might actually make thing easier later on. By using certain libraries of frameworks it might not actually be much more complex than event sourcing in the weak sense. As such when designing a new architecture or service, I think it's something that should at least be considered.
I know some of the things might be controversial, please feel free to discuss.