This article is an extension of a consulting session which I undertook with one of the start-ups participating in the Google Launchpad program. If your have GDPR, cryptography, or data protection questions, please feel free to reach out to me.
Personable Identifiable Information (PII) is afforded extra protection under the General Data Protection Regulation (GDPR), and whilst encryption isn't mandated by the GDPR it can be a great way to check-off a number of GDPR requirements with a minimal technical overhead.
This document assumes that we want to comply with at least the following articles and places a strong focus on how to do it with as few/little invasive technical changes as possible.:
- Art. 17 GDPR "Right to erasure"
- Art. 33 GDPR "Notification of a personal data breach to the supervisory authority"
Article 17 can be summarized as:
- You must delete all data if the person detailed in that data requests it, and
- You must not store data you don't need anymore (e.g delete order history after 6 months, chat history after some days, etc)
Article 33 lays out in detail what the data holding company must to announce any data breach to the authorities (article 34 details what companies must do to notify the subject the data).
Both those articles are fairly short, and as private citizen or "data subject" in GDPR terms, it's reassuring to know that the GDPR is mandating much more protection than most web-applications or -sites usually provide, having worked on back-end systems myself for over a decade.
Data is not the new gold, data is the new uranium.— Filippo Valsorda (@FiloSottile) August 16, 2019
Sometimes you can make money from it, but it can be radioactive, it's dangerous to store, has military uses, you generally don't want to concentrate it too much, and it's regulated.
Why keep uranium you don't need?
The right to erasure looks simple on the surface, if a data subject requests it, all data related to their account must be removed; at trivial scale this is as simple as it looks. Once other systems start to creep in (3rd party chat solutions for customer support, help-desk software, email list management software, etc) etc it becomes more and more difficult to know to whom a piece of data really relates.
Assuming the 3rd party is GDPR and/or ISO/IEC 27001 compliant, then your bases are covered, they in-turn must be encrypting and/or protecting their data adequately, and when you request a deletion on behalf of your customer, they must comply with your request, just as you must comply with the "downstream request", problem solved 🤝.
Art. 17 3.d makes a note about historical archival or archival in the public interest qualifying as an exception to the data subject's right to be forgotten, but your archival and backup processes must comply.
Whilst technically it may be possible to re-open backups in cold-storage and rewrite them to exclude user data, I strongly beleive that the risk of modifying backups in this way are so high as to render the backups useless if they are modified after storage.
In both cases of Art. 17 1.a and 1.b (stale data, and withdrawal of consent) management of data in cold-storage, or backups is extremely problematic,
There's no explicit mention I could find in the text of the GDPR, but protecting against accidental copying, or malicious theft of data is also critical to ensuring integrity, and this can be summarized as "log who access what data, when" (and act upon suspicious behaviour). This is touched-upon in Art. 32 "security of processing", but the theme recurs throughout the text of the GDPR, if a basic level of logging which data was accessed is not in-place, most other articles are unimplementable.
Given that we will, at some point need to correlate data with 3rd party systems, and that we have a "write only" storage in our backups (whether enforced physically through read-only media, or policies), and that we want to avoid ever finding ourselves in an "Article 33" situation, how can we possibly comply? Not only will our production database instance be a goldmine for would-be hackers, but also our backups which we can't change?!
The obvious answer is cryptography, the GDPR notes that if encrypted subject-data is lost, then Article 33 doesn't apply, unless the keys to that data have been breached too, in which case it is to be handled as-if an unencrypted database was breached.
Additionally, whilst not yet tested before a judge, it is strongly suspected that destroying keys to encrypted data is akin to destroying the data itself.
All of the advice I am about to give is predicated on one golden rule: never implement your own cryptographic code. It is so easy to make a tiny, seemingly insignificant mistake which renders all of your data worthless, so stick to standard such as those laid out by NIST, or enisa, which mostly means:
- Use a slow hash for passwords (e.g bcrypt2 over md5/sha). Checksums such as MD5 and SHA1 are designed to be fast, this means if you experience a breach an attacker can provision lots of GPU resources on a cloud platform and probably crack most of your user passwords with a few seconds. bcrypt2 (and others) are designed to run in near constant time, and have a concept of both salting, and "folds", which raises the security of the password by many, many thousands of times. It may be possible to check 10,000 SHA1/md5 hashed passwords per second, and maybe 1-2 per second with a correctly configured bcrypt2 implementation.
- Use sufficiently large keys whether using symmetric or asymmetric cryptography. As computers get more powerful brute-force attacks on encrypted data become more feasible. Minimum key (password) size recommendations differ by orders of magnitude depending if the cryptography in use is symmetric or asymmetric (a 15360 bit asymmetric key is roughly equivalent to a 256 bit symmetric key). For RSA (asymmetric) a key size of 2048 is recommended as an absolute minimum, with 3037 recommended for data which should remain secure after the year 2030. For all practical purposes 4096 bit keys are as easily used as 2048 or 3072 and offer protection that should stand for at least a few hundred years.
Practice good key hygiene, and don't write your own cryptography code. Rely on libraries from trusted vendors such as language implementers or NaCl which has bindings to many languages. Treat your keys exceptionally carefully, try not to ever store them outside of a secure storage. That includes for e.g not writing them to log files like
Could not decrypt '01010101010101010101010101010101' with key '0xFFFFFFFF'.
Let's walk through how to apply a key management system such as AWS Key Management Service, Google Cloud Key Management Service, Microsoft Azure Key Vault or Hashicorp's Vault to try and provide the following:
- Secure data in the "production database" as well as in "at rest" places such as backups to protect our customers and subjects from an Art. 33 GDPR "data breach" case.
- Access logging to know who had access to what data, via what APIs, and when (helps give us a strong basis to assist the authorities, and indemnify ourselves in-case of a breach which needs to be investigated)
- Near automatic implementation of deletion of older, or consent-withdrawn data.
We'll assume this schema, and try and categorize the data inside:
The first step is easy, let's identify all the "relational" IDs/references, we can't encrypt those else our software won't work. The anonymous data may be enough to still leak private information but there's not much we can do about that, even the best-in-class anonymization techniques can be correlated with enough data to nearly always uniquely identify people.
We won't touch these IDs, and we won't touch timestamps. Helpfully all the references in the tool I used to make the schema graphic are coloured in pale blue, and all the timestamps end with
We then need to define a schema for what keys to set-up, as a rule you want to use keys sparingly, but in a way that groups related data. One key per "entity" is probably fine, it doesn't make security or economic sense to have one key for each field in an record.
In our schema I would recommend one key for each user, order and merchant.
products table here is common, and shared and there's no need to encrypt it, one might expect that this is the public listing of products for sale, so it's reasonable to leave it "in the clear" in the database, too.
This might raise a question about whether
order_items.product_idmaybe should be encrypted, contrary to our earlier broad idea to leave them in the clear. Encrypting this would make it difficult to implement a "search for order by product" feature in your application, but would protect the data subject more in case of a breach.
merchants are also a data subject in your platform, they may require the same protection as your
This is all in-line with the GDPR principle of "protected by default, and design"
Three distinct kinds of keys, then in the first iteration:
- One key for each
- One key for each
- One key for each
order, (or one key per group of
So in the simplest system where one
user has placed two
orders from one
merchant`, we have four keys to manage.
Something I like to call a "key-hierarchy" can be really powerful, if designed-in at this point.
Let's say a user wants to be deleted under Article 17, we can delete their key from the KMS, rendering their data unusable. If we encrypt the
order keys with the
user key, then deleting the user key is sufficient to also destroy the user-order data too.
That would simplify the book-keeping of keys, by having data "belonging to" a data subject encrypted with an encrypted key.
This might sound like back-flips, but when processing an order we have the order ID, and the user ID in clear-text, and can easily pull two keys out of the KMS, using one key to decrypt the other; if either key is removed from the KMS (the order key, because the order is
n months old, and stale, or the user key under Art. 17) then our GDPR obligations are fulfilled, for current, and historical data.
KMS' can only encrypt a limited amount of data, so if you plan to encrypt more than 4KiB of data (4096 bits) it may be necessary to generate another key. This should be familiar to anyone who knows about AES/RSA where RSA (asymmetric) which is slow, extremely robust, but can't encrypt much data is used to encrypt an AES (symmetric) key, which is much faster, and can encrypt data of arbitrary length with virtually no loss of security.
For ease, I would suggest using the KMS to encrypt/decrypt an AES key which you store in your own database, this is simpler, more flexible and helps to keep costs under control because you can use one key to encrypt a larger volume of data. AES keys are more like pass-phrases than "keys", as the same key is used to decrypt and encrypt, that's the "symmetric" part.
If using a cloud-based KMS, you can create these keys at the same time as orders and user or merchant registrations are coming in, receiving a combined signup and order process your steps might look like:
Reach out to KMS and create two keys. Remember the key IDs / URNs which come back (e.g
1234abcd-12ab-34cd-56ef-1234567890ab). For each key we'll need to select "who" can administer and use the key (see below), but the "whos" in this case should be individual pieces of software in your infrastructure. This approach is predicated on using separate IAM accounts for backup processes, nightly scheduled jobs, the daily running of the platform and any administrative tasks.
a. Choose which roles identity and access management (IAM) may access this key. We might set-up individual roles for the admin back-office application, various other internal-only systems that are distinct from the daily-business tools, and possibly even an IAM role for the user themselves. (e.g the data subject). Creating IAM accounts for your customers may seem unusual, but they also have a right to access their own data, and it makes for cleaner logs if they leave a trace as themselves in the audit trail when checking out their own order history than as
backend application. Anyone (any IAM role) who can access the key will be able to ask the KMS to decrypt the user key and then decrypt the data. Do this programatically from your application code.
b. Choose who can administer the key. Creating IAM roles for your data subjects makes more sense when you learn that you can make them an administrator of their own key, you never have to let them know, but you could directly wire the "delete your account" flow to the IAM and KMS, and make sure that without any human interaction that when a user requests it, their data (via the KMS) is deleted without your company having to dedicate any staff resources to this niche of GDPR compliance. Do this programatically from your application code.
Generate AES keys, one each for the key IDs that we received from the KMS. We use these keys, the AES keys to encrypt our data. The KMS can only be used to encrypt a small amount of data, up-to 4096 bits. AES has no such limitation. We can generate any random bytes upto 4096 bits by just reading random data from our system libraries.
Replace sensitive fields in the incoming user, or order data with their encrypted counterparts (our AES key).
Ask the KMS to encrypt our AES keys this is how we use the KMS securely, we won't store our AES keys in the plain text, so anyone stealing our database will have encrypted data (with our AES key) and an encrypted copy of our AES key, but they won't have the means to decrypt the AES key, or the user data protected with that key, without access to the KMS.
a. Optional: Use the user AES key to encrypt the order AES key, and ask the KMS to store that double-encrypted key. This doesn't buy any extra security, but means we can destroy our own access to order and user data by destroying the user key.
Store the ARN (Amazon resource name) and the encrypted AES key data from the KMS along side the
Laid out in excruciating detail, this looks like a huge burden, but in code it's much, much friendlier, take a look at this:
This means we'll have a database that looks something like this in the end:
Not super valuable to hackers!
The set-up case is convoluted, but the reverse case is actually much, much simpler. Assume that your software is running with access to an IAM identity which can read the keys, e.g this is the
- Recognise that we have a legitimate need to display User 123's order number 456
- Load the data from the database, just as we do now.
- Display the page with encrypted data place-holders.
This case might be an instance of someone in customer service wanting to check if a customer order has shipped yet whilst the customer is on the support hotline, we don't need to decrypt all their personal data!
If there's no need to decrypt the data, let's not do it, then we don't leave a trace in the cloud trail audit logs! If we do need to access the data however, it's easy to unpick.
Our fields are encrypted with our AES key, and that AES key is encrypted with the corresponding KMS key identified by the ARN we stored.
- Ask KMS to decrypt
aes_keyusing the KMS key identified by
- Get back
aes_key(s)(do not store it), use
aes_key(s)to decrypt and display user and/or order data.
- Access from our IAM account (
admin application) is logged in the audit log for key(s)
At the time of key creation keys can be set to auto-destruct after a period of time, and then "kept alive" by extending that time before it has expired. This allows a "dead man" switch to be used rather than requiring clean-up scripts or processes to be put in place.
KMS aren't expensive to use, but they aren't "free" either, if a user is likely to place a high number of orders, complete a large number of quizzes, start a large number of conversations, or create any large number of things that makes key-bookkeeping sound daunting, consider another approach such as using one common key for user-orders placed within a calendar week or month, any easily derived static name which can be used to look-up the data. (just think about how your bank doesn't let you get access to on-line transaction data after 90, or 180 days...).
A CMK (the keys we've been talking about creating) cost $1/key/month to store in Amazon KMS, plus ongoing access costs. If this number is prohibitively high then finding a less-fine grained key scheme might be advisable. Google KMS keys, by comparison cost $0.06/key/month which is much more suitable for this use-case. I wrote this guide from the perspective of AWS which I know better, although the start-up in question will be hosting on Google cloud which makes this level of compliance and security much, much more affordable.
ℹ AWS KMS costs $0.03 per 10,000 requests which is significantly less expensive than dealing with safe handling than unencrypted data!
Don't worry if this seems daunting, many of the principles here are familiar to professional developers, but prior to the GDPR there wasn't much incentive to implement them quite like this. As a consumer and "data subject", as well as a data security professional I am thrilled to see this level of security applied to user data. It took our industry about 10 years to get most developers to use real password hashing algorithms rather than fast checksumming algorithms, and some environmental pressure imposed by the GDPR should really help accelerate the adoption of good data practices in general.
My company offers consulting services in this branch, and we are always available for a chat in case you need more information, or for contracts evaluating data migration and mitigation of risks under the GDPR.