DEV Community

Ark Shraier
Ark Shraier

Posted on

What is Event Sourcing?

Top comments (7)

Collapse
 
dmfay profile image
Dian Fay

Think about states. A traditionally designed database represents the last known state of a system: Jane was admitted to hospital on this day, for these reasons, is staying in room 119, was seen by doctor so-and-so at 11am.

Certain states are consistent: for example, one visit with a doctor is much like another visit with a doctor. In a traditional database, you'd represent this with a junction table relating this doctor to that patient at this time. Retrieving junction and related records gives you a timeline of visits. What it doesn't give you is context: all you know from querying the visits table is that Jane was (presumably) in the hospital some time before her first visit and some time after her last. Admittance and discharge are external; prescription updates are external; all kinds of things that would be very useful to know about in the context of visit history are somewhere else, and difficult or impossible to aggregate.

Besides Jane herself, there's one important aspect of each data point I mentioned: it happened at a certain time. Jane was admitted; Jane saw a doctor; Jane was allocated a room; Jane received a new prescription; Jane saw another or the same doctor; Jane was discharged. Event sourcing correlates the identifier (here, the number printed on Jane's wristband) with the time something happened. Each thing that happened during Jane's stay is an event: there's an admittance event, a visit event, a room allocation event, a prescription event, another visit event, a discharge event.

Each event includes some relevant information, with a visit being tied to the doctor, the prescription to the medication info, and so on. As an event is processed, that information may be materialized into tables which reflect the current state (eg a rooms table which now shows that #119 is occupied). But the events table is the source of truth, and the state tables can be completely deleted and reconstituted simply by processing all the events in order again. That's the "sourcing" part. Also, it can be useful to simply query all events relating to Jane to get a complete history of her stay in one place.

Example may or may not have been shamelessly stolen from Mat McLoughlin's talk at NDC Oslo 2017 :)

Collapse
 
ark profile image
Ark Shraier

Thanks a lot for the answer and the link.
Recently, I've found good video regarding this topic and Doman Driven Development in Ruby RubyC-2018 / Andrzej Krzywda "From Post.create to PostPublished.new"

Collapse
 
kspeakman profile image
Kasey Speakman • Edited

Your bank account

The most common example that you probably already know well is a bank account. It is "event sourced". If your bank account were written like most software -- with destructive updates -- you could see the current balance but have no idea how it got to be that number. Instead, your account is literally a list of transactions (events). The current balance can be calculated at any time by starting from zero and applying all transactions.

Databases use them

Yet another event sourcing implementation you might already know about is found in many databases. It is called a "journal" or a "write-ahead log" or "transaction log". The database uses it internally to recover from crashes or replicate data. Many kinds of replication are based on transmitting this event log to replicas and having them apply it.

Event Sourcing, recently

So the concept has been around a while and is sometimes called log-based storage. But the idea behind Event Sourcing is to record any kind of significant business event. Then use these as the source of truth for application data. For example, we do computer-based training. So significant events for us are when a trainee registers for a course, starts the course material, submits a test for grading, successfully completes course requirements, etc. So that way we have a complete audit trail of how it came to be that the student's course was completed. We can also create new (but historical) reports to answer questions we didn't think about until later. Like "How long are students taking on each course?"

Event Sourcing typically does not exist in a vacuum. Certain kinds of common queries are really expensive for log-based storage. For example, uniqueness checks. So you typically need two kinds of data models. Any changes to the system (writes) are recorded in the event log. Then the event is applied against relational tables, key-value store, etc. which are used for reads. Both write and read models could be in the same database. For example, we use Postgres for both currently.

Having two models of data may seem like a lot of overhead. However, I have found that the idea of having a single model of data rarely matches the reality of how it is used. You need to know certain things for reads that are not important for writes and vice versa. For example, a full-text search index may be important for reads, but the business logic has no need of it. Even when you try to serve two masters with a single model, success means you usually end up with a second database for reporting. Because reports can be really expensive to generate against normalized operational data. Also the reporting database is usually a bit behind the operational data by hours or days. Event Sourcing forces you to address these issues up front. Plus I have found that having an event log as your source of truth gives a lot of architectural flexibility.

Collapse
 
ambroselittle profile image
Ambrose Little

Thanks for calling out that such an architecture makes explicit up front what many folks defer to later (or are simply surprised to find, if not very experienced). Coupled with a data pipeline, this approach is a proactive way to deal with the varied demands on data while maintaining the single source of truth.

I'm also dealing with a hybrid system that, mostly due to legacy, more or less still treats the relational DB as the acting source of truth, while also (now) logging to an event log (the formal source). Sort of inverts the order of events, perhaps, but it also helps deal with avoiding the problem of eventual consistency from a UX point of view (i.e., the operation is not complete until the RDBMS is updated AND the event is logged, then it hands it off to a pipeline). This is of course for those things that are not anticipated to be asynchronous, from a user point of view.

Collapse
 
ark profile image
Ark Shraier

Very good example with bank account, thanks!

Collapse
 
qm3ster profile image
Mihail Malo

What's a good event store on a major public cloud provider?

Collapse
 
kspeakman profile image
Kasey Speakman • Edited

Assuming you mean distributed/scalable, I have not found this unicorn as yet. Maybe we need to develop it. :)

For further information, here is a response I gave in a StackOverflow answer about Kafka as an event store. Note that the EventStore product mentioned below is open-source and they will provide a tuned AMI for AWS if you pay for support.


It seems that most people roll their own event storage implementation on top of an existing database. For non-distributed scenarios, like internal back-ends or stand-alone products, it is well-documented how to create a SQL-based event store. And there are libraries available on top of a various kinds databases. There is also EventStore, which is built for this purpose.

In distributed scenarios, I've seen a couple of different implementations. Jet's Panther project uses Azure CosmosDB, with the Change Feed feature to notify listeners. Another similar implementation I've heard about on AWS is using DynamoDB with its Streams feature to notify listeners. The partition key probably should be the stream id for best data distribution (to lessen the amount of over-provisioning). However, a full replay across streams in Dynamo is expensive (read and cost-wise). So this impl was also setup for Dynamo Streams to dump events to S3. When a new listener comes online, or an existing listener wants a full replay, it would read S3 to catch up first.

My current project is a multi-tenant scenario, and I rolled my own on top of Postgres. Something like Citus seems appropriate for scalability, partitioning by tentant+stream.