Discussion on: What is Event Sourcing?

View post

Your bank account

The most common example that you probably already know well is a bank account. It is "event sourced". If your bank account were written like most software -- with destructive updates -- you could see the current balance but have no idea how it got to be that number. Instead, your account is literally a list of transactions (events). The current balance can be calculated at any time by starting from zero and applying all transactions.

Databases use them

Yet another event sourcing implementation you might already know about is found in many databases. It is called a "journal" or a "write-ahead log" or "transaction log". The database uses it internally to recover from crashes or replicate data. Many kinds of replication are based on transmitting this event log to replicas and having them apply it.

Event Sourcing, recently

So the concept has been around a while and is sometimes called log-based storage. But the idea behind Event Sourcing is to record any kind of significant business event. Then use these as the source of truth for application data. For example, we do computer-based training. So significant events for us are when a trainee registers for a course, starts the course material, submits a test for grading, successfully completes course requirements, etc. So that way we have a complete audit trail of how it came to be that the student's course was completed. We can also create new (but historical) reports to answer questions we didn't think about until later. Like "How long are students taking on each course?"

Event Sourcing typically does not exist in a vacuum. Certain kinds of common queries are really expensive for log-based storage. For example, uniqueness checks. So you typically need two kinds of data models. Any changes to the system (writes) are recorded in the event log. Then the event is applied against relational tables, key-value store, etc. which are used for reads. Both write and read models could be in the same database. For example, we use Postgres for both currently.

Having two models of data may seem like a lot of overhead. However, I have found that the idea of having a single model of data rarely matches the reality of how it is used. You need to know certain things for reads that are not important for writes and vice versa. For example, a full-text search index may be important for reads, but the business logic has no need of it. Even when you try to serve two masters with a single model, success means you usually end up with a second database for reporting. Because reports can be really expensive to generate against normalized operational data. Also the reporting database is usually a bit behind the operational data by hours or days. Event Sourcing forces you to address these issues up front. Plus I have found that having an event log as your source of truth gives a lot of architectural flexibility.

Mihail Malo • Aug 16 '18

What's a good event store on a major public cloud provider?

Kasey Speakman • Aug 16 '18 • Edited

Assuming you mean distributed/scalable, I have not found this unicorn as yet. Maybe we need to develop it. :)

For further information, here is a response I gave in a StackOverflow answer about Kafka as an event store. Note that the EventStore product mentioned below is open-source and they will provide a tuned AMI for AWS if you pay for support.

It seems that most people roll their own event storage implementation on top of an existing database. For non-distributed scenarios, like internal back-ends or stand-alone products, it is well-documented how to create a SQL-based event store. And there are libraries available on top of a various kinds databases. There is also EventStore, which is built for this purpose.

In distributed scenarios, I've seen a couple of different implementations. Jet's Panther project uses Azure CosmosDB, with the Change Feed feature to notify listeners. Another similar implementation I've heard about on AWS is using DynamoDB with its Streams feature to notify listeners. The partition key probably should be the stream id for best data distribution (to lessen the amount of over-provisioning). However, a full replay across streams in Dynamo is expensive (read and cost-wise). So this impl was also setup for Dynamo Streams to dump events to S3. When a new listener comes online, or an existing listener wants a full replay, it would read S3 to catch up first.

My current project is a multi-tenant scenario, and I rolled my own on top of Postgres. Something like Citus seems appropriate for scalability, partitioning by tentant+stream.

Ark Shraier • Aug 14 '18

Very good example with bank account, thanks!

Ambrose Little • Aug 14 '18

Thanks for calling out that such an architecture makes explicit up front what many folks defer to later (or are simply surprised to find, if not very experienced). Coupled with a data pipeline, this approach is a proactive way to deal with the varied demands on data while maintaining the single source of truth.

I'm also dealing with a hybrid system that, mostly due to legacy, more or less still treats the relational DB as the acting source of truth, while also (now) logging to an event log (the formal source). Sort of inverts the order of events, perhaps, but it also helps deal with avoiding the problem of eventual consistency from a UX point of view (i.e., the operation is not complete until the RDBMS is updated AND the event is logged, then it hands it off to a pipeline). This is of course for those things that are not anticipated to be asynchronous, from a user point of view.