Dan Lebrero

Posted on Apr 16, 2018 • Originally published at danlebrero.com

Kafka, GDPR and Event Sourcing

#architecture #kafka #gdpr

Image attribution: He Wasn't This Angr by Allison Mickel.

You probably already know that the EU has approved this nice piece of legislation called GDPR (General Data Protection Regulation) that gives us back some control over our personal data.

From a technical point of view, if you have bought into Event Sourcing and Kafka, it is of special interest GDPR’s "right to erasure" (aka. forget everything that you know about me), as it is at odds with the idea of an immutable event log that does not forget anything.

To handle GDPR in an event sourced architecture, here are the most interesting options:

Removing data from projections might be good enough. A suggestion from Michiel Rook's blog is that maybe is enough to remove the data from the projections/read models, and there is no need to touch the data in the event store. If this option is within the law, the "right to erasure" becomes just another event that projections need to handle. A perfect fit for Event Sourcing.
Deleting/updating Kafka messages: Ben Stopford reminds us that in Kafka you can "delete" and "update" messages if you are using a compacted topic, which means that to comply with the "right to erasure", we need to find all the events for a user and for each send a new message with the same key (the event id) and a null (or updated) payload.

The main concern with this approach is that the event store is no longer immutable, so it will be very tempting to use the same loophole in other non-GDPR situations.
Encryption: Another suggestion from Michiel’s blog is to encrypt all the messages for a particular user with a key, and when the user want to exercise its "right to erasure", we just need to forget the encryption key.

The issue with this approach in the key management. In Michiel’s words: "storing, finding and retrieving the right encryption key ... becomes especially interesting at scale". And because it is interesting, let's dive into a possible solution.

Highly available, highly scalable RESTful KeyManagement service

Synchronous HTTPS? Seriously?

The Kafka way

Assuming that you are already storing your data in Kafka, and given that Kafka is able to handle data at scale, why not use Kafka itself to store and retrieve the encryption keys?

Let’s start with a picture of how our architecture could look like:

Your Event Producer is your regular service that pushes unencrypted data to some To-Encrypt topic.

To comply with GDPR, this topic will have some reasonably short time-based retention policy, so that Kafka deletes the data after that time, but remember that the retention period should be longer than your expected downtime of the Encryptor service, as if the Encryptor service is down for longer, Kafka may delete the data before it is encrypted and safely stored in the Encrypted-Data topic.

The Encryptor service will take care of encrypting any message and generating new encryption keys for new users. It leverages Kafka Streams state management to keep a local copy of the encryption keys for the partitions that each instance owns, so that looking up an encryption key will be at most a disk seek.

This application also has to react to the user exercising his right to be forgotten by deleting the local copy of the encryption key from its state, and by deleting the encryption key from the Encryption-keys topic.

The Encrypted-Data topic will be where the events are stored forever, with no retention policies. This is your event log.

The Encryption-Keys topic will be a compacted topic. When it is time to forget the user, the Encryptor service will just send a tombstone to override the user’s encryption key, so it is lost forever and nobody will be able to decrypt its data again.

To decrypt the data, the Event Consumer will basically will need to do a join of the Encrypted-data topic with the Encryption-Keys topic. Again, we will rely on Kafka Streams state management to keep a local copy of the encryption keys.

Similar to the Encryptor, the Event Consumer will need to react appropriately when the user request to be forgotten, both by deleting the local encryption key and any other state associated with that user.

This architecture looks fabulous from this ivory tower.

Image attribution: The Ivory Tower by Peter Bartels.

Implementation details

If you want to get your hands dirty, the implementation details are here.

Conclusions

In summary, we comply with GDPR because our to-encrypt topic has a short time-based retention policy, our encryption keys are in a compacted topic and our event log is encrypted with a per-user encryption key.

Also, our applications have to handle a new "forget me" event type and erase any PII data that they may store.

As we saw, the implementation is not rocket science, but it raises some more challenges:

Do we encrypt the whole message or just a subset? If it is just a subset, how do we handle schemas? If not a subset, we lose all the data, even the non-PII one.
Can we reuse the same encryptor for multiple topics? If so, topics must be copartition. If not, we will need to separate the key generation from the encryptors, so the encryption keys can be repartition.
Even if the decryption is transparent to the consumer, it still needs to handle the "forget me" special case.
You will need to choose an encryption algorithm that is fast enough and secure enough. Can you afford an additional 1 or 10 milliseconds processing time to each message? In theory, if the consumer is up to date, it can always consume directly from the to-encrypt topic.
A comment in Michiel blog points out that forgetting the key is not enough. Every few years, we also need to update encryption algorithms, which means we need to encrypt everything again.

So it seems possible to use encryption to handle event sourcing data in Kafka, but is it better than the other options? For sure it is worse than removing data from projections, if this is an option at all. But, is it better than just using a compacted topic to store the event log as Ben Stopford suggests?

Well, how much do you value immutability? That much?!?! That little?!?!

Top comments (5)

Kasey Speakman • Apr 17 '18 • Edited

I dunno about Kafka, since it does not work for me as an event store (or at least the kind I have needed so far). But in SQL-based stores, you can delete the stream pretty easily with DELETE FROM EventLog where StreamId = ?. And in EventStore you can hard delete a stream and scavenge to remove the events. But in either case, you should probably write an event to the end of the stream signifying that the user requested removal and wait for the read models to process it and remove the data from their storage first. Suddenly deleting or modifying streams does not signal the projections to do likewise.

Dan Lebrero • Apr 17 '18

Thanks a lot for the comment!

I think that using Kafka means that you need to change how you design your architecture. You cannot follow the "read from DB", "update in memory" and "write result to DB" model anymore. You have to embrace event based architectures. I think I cover both of your concerns here.

As with SQL-based stores, it is possible to delete events from Kafka, but the point is that you lose the immutability guarantees, which means that you open the door for "updating" events for other causes unrelated to GDPR. With SQL-based stores and Kafka compacted topics you have to rely on the team's discipline to not misuse the mutability of the store. With regular Kafka topics you just cannot touch the events, you are sure than nobody manipulated them.

Even if I think that immutability is better, as it gives you strong guarantees, I think the additional complexity caused by GDPR may make it impractical.

Kasey Speakman • Apr 17 '18 • Edited

Thanks for the article.

I do not think it addresses the concerns, as isolation between entities is still a large problem (How do I check state of a single entity in order to validate and fulfill a request? What happens when the state structure of the entity needs to change due to new features?). The deletion problem in Kafka highlights the isolation issue. Since you cannot do topic per entity feasibly, you are forced to mutate the topic with no real audit guarantees that only a single entity is affected. Kafka was designed for a large problem, so it doesn't suit the small granularity of this requirement.

I do embrace event-based architectures, and Kafka has an important role to play there. But I don't think it is the right tool for everything.

This is a bit of a derail from the original topic, and I apologize for that. If you want to discuss it further we can do so in a separate article or you can email me. kasey symbolic-at cornerspeed dee-oh-tee com

Stéphane Bisinger • Apr 17 '18

Although the risk of abuse is quite high, I would personally opt to go around the immutability in this particular instance. Removing it from the projections is probably not enough and the encryption system creates more problems than it solves.

Or, maybe, it's time to see if there could be better options to handle this than Kafka?

Thanks for the interesting read!

Dan Lebrero • Apr 17 '18

For my current context, I am also leaning towards mutability as we are not a big enough team to handle the additional complexity of encryption, but it scares me the potential for abuse.

And yes, maybe Kafka is not the best choice. If you use Java, the Axon framework is going to support something similar out of the box: slideshare.net/Frans_van_Buul/axon...

Cheers!

Dan