guto

Posted on Apr 15, 2024

Meet Datomic: the immutable and functional database.

#clojure #database #functional #programming

"When you combine two pieces of data you get data. When you combine two machines you get trouble." - Rich Hickey presenting The Functional Database.

Concepts focused on functional programming, mainly on immutability, have been increasingly present in our daily lives, therefore, nothing more fair than getting to know a database whose philosophy is the immutability of data, bringing control of facts in a format completely different from what we are used to.

In this article, we will get to know Datomic, which has this name precisely because it brings data in a format slightly different from the conventional one, seeking to bring data immutability closer to the database level, with a functional approach focused to work well with distributed systems.

What is Datomic?
Architecture
Data structure
How a transaction works?
Conclusion

What is Datomic?

At the beginning of 2012, the Relevance team (later joining Metadata to form Cognitec), together with Rich Hickey, launched Datomic, which they began working on in 2010, with the main motivation being to transfer a substantial part of the power assigned to database servers for application servers, so that the programmer would have more programming power with data within the application logic.

Datomic Cloud was released in early 2018 using Amazon's components:

DynamoDB, EFS, EBS and S3 as storage services;
CloudFormation for deployment;
AWS Cloudwatch for logging, monitoring, and metrics.

Cognitect (the company previously responsible for developing Datomic) was acquired by Nubank in 2020, and Nubank announced in April 2023 that the Datomic binaries are publicly available and free to use (this means that its Pro version is now free to use).

Written in Clojure, Datomic works a little differently from the databases we are used to using, being used to manage data, but not store it. We'll go through more details about its architecture, but in short, it means that Datomic can use several other data storage services to store transactions, even other databases, which can result in a nice combination.

Concept

The main operating concept of Datomic is that the data is immutable as a whole. An interesting analogy for you to imagine and understand how it works a little better is:

Imagine a database that you are used to working with, such as PostgreSQL or MySQL;
Imagine now that you have two tables, a product table, but you also have a log table for these products, storing every modification that was made to the original product table;
When you update an item, this item has its data modified in the product table, but we add its previous value to the log table, highlighting what was done (in our case an update on its value);
For Datomic, there would only be the product "table" (in our case schema), with an additional column, indicating whether that item is true or not, so, when we update the product value, a new line would be added with the new value, however, the old line now has the check column value set to false, after all, it is no longer true at the current time.

This means that past data continues to exist, but there is this representation that indicates whether the value of the product is valid or not. These lines are called facts, so when the word "fact" is mentioned, remember this analogy.

One point to highlight: remember that no matter how much the value of the product has been changed, it remains a fact - no longer valid for the current time, however, it is still a fact that occurred, after all, the product in the past had this value.

Architecture

To better understand how everything works within Datomic, we first need to better understand how its architecture works.

How can I store data?

As previously mentioned, Datomic is not characterized by storing data as a database in the same way as we are used to, being used mainly to "transact and manage data". You can combine Datomic in different ways, such as:

SQL databases (such as PostgreSQL and MySQL);
DynamoDB (if you choose to use Datomic Cloud, the Indexes will be stored in S3, Transaction Log in DynamoDB and Cache in EFS);
Cassandra and Cassandra2;
Dev mode, storing data in memory. How it works is very simple: depending on your choice, a table will be created within your database in order to store all the data, which Datomic will manage.

Peers

Every type of interaction that an application will make with Datomic will be through a Peer, responsible for carrying out a small cache control, assembling and executing queries, bringing indexing data in addition to sending commits. Peers can be used in two main ways, namely:

Peer Library: a library to be added to your dependencies that will always be working with your application, after all, it is through it that you will carry out any type of action with your database;
Peer Server: being used mainly with the Datomic in Cloud format, your application now has only a Client library, responsible for communicating with the Peer Server that will perform direct actions with Datomic, as well as the Peer Library.

Peers work with a given database value for an extended period of time without concern. These values are immutable and provide a stable, consistent view of data for as long as a needs one, functioning as a kind of "snapshot" of the database to allow data to be returned in a more practical way without overloading with multiple queries in real time . This is a marked difference from relational s which require work to be done quickly using a short-lived. You can see more about the description of how Peers work in official documentation.

An important point to highlight is: if you are using the Peer Library, each node of your application, each service will have a Peer running alongside it, therefore, this means that in the process of a distributed system, these multiple Peers will be responsible for sending commits to your database, however, in a distributed system it is extremely important to control data consistency, after all, if I have multiple services sending queries in parallel, how can I control that the data is truthful and correctly passes a race condition? Well, to understand this better, let's understand what Transactor is.

Transactor

The Transactor is responsible for transacting the received commits and storing data within our database. The architecture of an application with Datomic is characterized by having multiple Peers working and sending commits (after all, we will have multiple services), but only a single Transactor, guaranteeing total data consistency. This means that regardless of whether the Transactor is receiving n commits per second, they will all be queued to guarantee total consistency.

The main point of this architectural format comes from understanding that dealing with concurrency in general, allowing multiple data to be stored in parallel in our database, especially in case of a distributed system, could negatively affect the consistency of stored data, being a drastic problem.

Now, we can look in more detail at a diagram that demonstrates how this entire architecture behaves:

Note in the diagram above that the Transactor is responsible for all activities that require direct communication with the data storage service used, in addition to controlling data indexing formats, working with a memcached cluster, and responding to commits. Thus, we can state that Datomic deals with ACID transactions, an acronym that refers to the set of 4 key properties that define a transaction: Atomicity , Consistency, Isolation, and Durability.

Storage Services

"Peers read facts from the Storage Services. The facts the Storage Service returns never change, so Peers do extensive caching. Each Peer's cache represents a partial copy of all the facts in the database. The Peer cache implements a least-recently used policy for discarding data, making it possible to work with databases that won't fit entirely in memory. Once a Peer's working set is cached, there is little or no network traffic for reads." - from Datomic Pro Documentation.

Storage Services can be configured however you like, all you need to do is create a properties file (called transactor.properties) to represent how your Transactor will be created and managed.

So, if you use PostgreSQL, for example, you will have to configure the driver that will be used, the connection url, username and password, and you can also configure values for memory-index-threshold, memory-index-max , object-cache-max, read-concurrency and even write-concurrency, ultimately creating a table named datomic_kvs within your PostgreSQL:



CREATE TABLE datomic_kvs (
  id text NOT NULL,
  rev integer,
  map text,
  val bytea,
  CONSTRAINT pk_id PRIMARY KEY (id)
) WITH (OIDS=FALSE);

Data structure

Well, we've already talked about the architecture of how Datomic as a whole works, so now let's better visualize what the basis of Datomic's data structure is like, starting with Datoms.

Datoms

"A datom is an immutable atomic fact that represents the addition or retraction of a relation between an entity, an attribute, a value, and a transaction."

So basically a datom is a simple fact in log, representing data changes of a relation. We can express a datom as a five-tuple:

an entity id (E)
an attribute (A)
a value for the attribute (V)
a transaction id (Tx)
a boolean (Op) indicating whether the datom is being added or retracted

All of these are from Datomic Cloud documentation. Look at the example below:

E	42
A	:user/favorite-color
V	:blue
Tx	1234
Op	true

Entities

"A Datomic entity provides a lazy, associative view of all the information that can be reached from a Datomic entity id."

Looking into an Entity we can visualize as a table:

E	A	V	Tx	Op
42	:user/favorite-color	:blue	1234	true
42	:user/first-name	"John"	1234	true
42	:user/last-name	"Doe"	1234	true
42	:user/favorite-color	:green	4567	true
42	:user/favorite-color	:blue	4567	false

The Transaction Id can be visualized as a point of the time which represents that data. In the example above we have 1234 and 4567. Look at 1234... In the first row, the :user/favorite-colour attribute has the value :blue, with op as true. But, in the future, at 4567 now the attribute has the op set to false for the attribute with the value :blue (now :green is set for true).

For us, we haven't changed manually the Op. Datomic automatically made this when we updated the value for :user/favorite-color. That means: Datomic automatically manage our data and set or update values, and we have the exactly point in time which the :user/favorite-color have been changed.

Schemas

As the documentation says: Attributes are defined using the same data model used for application data. That is, attributes are themselves defined by entities with associated attributes.

Well, for defining a new attribute we need to define:

:db/ident, a name that is unique within the database
:db/cardinality, specifying whether entities can have one or a set of values for the attribute
:db/valueType, the type allowed for an attribute's value
:db/doc (optional), the attribute's description/documentation

Look, all of these :db/ident, :db/cardinality and etc are only simple entities which pointer to each other. They are automatically generated by Datomic in the initial stage. This means: they have a default entity id.

How a transaction works?

"Every transaction in Datomic is its own entity, making it easy to add facts about why a transaction was added (or who added it, or from where, etc.)"

We have "two options" for transactions: add or retraction. Every transaction returns the transaction id and the database state before and after the transaction. The forms can be:



[:db/add entity-id attribute value]
[:db/retract entity-id attribute value]

How we saw before: every transaction occur in a queued mode.

If a transaction completes successfully, data is committed to the database and we have a transaction report returned as a map with the following keys:

key	usage
:db-before	database value before the transaction
:db-after	database value after the transaction
:tx-data	datoms produced by the transaction
:tempids	map from temporary ids to assigned ids

The database value is like a "snapshot" from the database, as we saw before.

Let's see an example of how :db/add works. Look at the example below:



;; We have this schema
{:internal/id ...
 :internal/value 0
 :internal/key " "}

;; Making a simple transaction
[[:db/add id :internal/value 1]]
;; This will update the value...

;; But, we can perform multiple
;; transactions, look:
[[:db/add id :internal/value 1]
 [:db/add id :internal/key "another"]]
;; It will work fine.

;; But, when we perform something like:
[[:db/add id :internal/value 1]
 [:db/add id :internal/value 2]]
;; We will have a conflict

The conflict occur when we have a change in the same entity with the same attribute. That's make sense, because we can't have a fact updating multiple times in the same time lapse.

A cool fact: if we perform a multiple transaction they occur in parallel (with multiple processing). This is secure because as we saw before, the same attribute can't be updated in the same transaction.

Conclusion

This article and the beginning of this series of articles aims to introduce Datomic and present its various possibilities and advantages for general use. It is important to highlight that the official Datomic documentation is excellent, therefore, for further in-depth research it is extremely important that you use it! And of course, if you want to take your first steps with Datomic, feel free to use Getting Started from the official documentation, but if you want a repository with the codes used, I've made a repository available on my GitHub (don't forget to give a star)!

Top comments (12)

Mike Talbot ⭐ • Apr 16 '24 • Edited

I'm hoping there is a way that you can actually delete data somewhere though? Otherwise this would end up being very, very hard to use for a range of use cases.

People frequently add data that they shouldn't to databases, information that it is dangerous or illegal to store because it exposes the the party holding the data to significant legal and financial penalties. For example, in a system I used to run, people upload pictures of their passports by accident, if deleting that picture didn't actually remove it from the database entirely - this would be a significant issue. Any business that is ISO 27001 certified would have to demonstrate at audit that full removal was possible and reasonably efficient. I know this because we had an event (immutable) database that the information was in, and it took whole reimports to remove this stuff.

In the EU we have GDPR, which includes the right to be forgotten and the right to see all information held about us. This isn't just the most recent information in the database, it is all of the information.

Event based databases have some very interesting characteristics, it would be great to find one that actually could handle real deletion without hoop jumping.

David Sugar • Apr 16 '24

I recall when svn tried promoting immutability as a feature ;).

But there are plenty of actual potential applications for immutable data systems, such as event logs, call detail records, etc, that need the ability to be queried and are often not public facing, but may also face legal mandates to preserve data. The solution to something security related somehow getting into such a system is to have the query / viewer scrub it, rather than to modify the stored data. Many of these kinds of systems historically used encoded binary data, too.

guto • Apr 16 '24 • Edited

Exactly! By default all data managed by Datomic is encrypted and stored in a byte array format. Every writing transaction needs the configured transactor, in addition to needing a peer to read the data.

Furthermore, for privacy reasons you can delete datoms using an excision (docs.datomic.com/pro/reference/exc...). But, in your opinion, what is the general applicability for using immutable data, beyond the example you said about svn? And what would be the possible problems to be faced in this architecture?

guto • Apr 16 '24 • Edited

Sure! You can completely remove datoms using an excision (docs.datomic.com/pro/reference/exc...). It is completely understandable, after all, in Brazil for example, there is the General Data Protection Law (LGPD), responsible for working with data privacy rights.

In one of the next articles in this series I will demonstrate how to use excisions! I hope you enjoy!

Mike Talbot ⭐ • Apr 16 '24

That's great! Thanks for the info :)

Mykola Androsovych • Feb 9 '25

I wouldn't be so sure about the excision forum.datomic.com/t/excision-is-ve...

Cherry Ramatis • Apr 16 '24

I read about Datomic a long time ago, and it never clicked with me. Now, with your awesome didactics, I'm starting to understand it.

Hope to read further continuation of the series! 🫶

J0sueTM • Apr 15 '24

Definitely a great read!

Suami Medeiros • Apr 16 '24

Best content!!! 🚀

Oziel Perez • Apr 18 '24

Why does everything have to be functional nowadays. Let objects be objects!

guto • Apr 19 '24

It's not about bring functional programming in general, but, the possibility to apply functional concepts into OOP. I will recommend this article written by @cherryramatis about exactly this: dev.to/cherryramatis/ending-the-wa...

And of course, if you don't like this idea, you can write about this by yourself and make propositions to defend this! Be free :)

Joseph Lawson • Apr 22 '24

"Developed in 2010" to compliment a lisp based language released in 2007.

So kinda not just something from nowadays but instead a mature technology getting some coverage.

Functional programming typically exhibits a preference of immutable data operation vs the encapsulation of complexity in oop.

The approaches are different. OOP shields from complexity with object contracts. Functional programming applies a core set of methods yielding programs wrapped around the data and business need.

Whether one aheres to the inherit qualities of the languages is left up to the user.