DEV Community

Sacha Greif
Sacha Greif

Posted on

Disclosing a State of JavaScript/State of CSS Data Leak

Every year tens of thousands of respondents trust the State of JavaScript and State of CSS surveys with their data, some of it quite personal and sensitive, and I'm fully conscious of the responsibility this represents.

So ever since starting to run the surveys, I've hoped that I would never have to write the dreaded "data leak" post. But sadly today is the day I need to address this issue.

TL;DR

An encryption key that makes it possible to decrypt publicly-available encrypted email addresses and link them to survey responses was mistakenly committed to a public GitHub repo.

Key Points

  • This is a human error, not a malicious attack.
  • The leak is now closed.
  • You are concerned if you answered the State of JS or CSS surveys before and up to 2020 (the 2021 JS and CSS surveys are not affected).
  • So far there is no evidence that the mistake was actually exploited, but I'll keep monitoring the situation.
  • Passwords were not affected as they use a completely separate hashing mechanism.

What Happened

This situation resulted from three separate mistakes:

  1. I made the decision two years ago to add email hashes (or so I thought) to publicly available survey responses datasets (for surveys up until 2020; 2021 datasets were not published yet) in order to use it as an ID and make it possible to track how a given respondent's answers were evolving over time.
  2. An open-source contributor contributed the function that generate those "hashes" and used a 2-way encryption function. Somehow over time I made the assumption that it was instead a 1-way hashing function.
  3. About a month ago, another open-source contributor committed private credentials -which included the encryption function's encryption key– to a public repo while working on a separate project. Although the contributor noticed the issue and scrubbed the history right away, the faulty commit apparently stayed accessible by itself as a "ghost commit" outside of a branch.

Both because of the holidays, and because I didn't realize the consequences of the leak right away, the encryption key stayed accessible in theory for about a month.

What This Means For You

The risks to survey respondents are two-fold:

  1. Someone could use the dataset to generate an email list used for spamming purposes.
  2. Someone could link personal data (salary, etc.) to the email address you used.

Was the Leak Exploited?

The "good" news is that the repo the key was committed to is very low traffic and had no forks, watchers, or stars, making it less likely that ill-intentioned people randomly stumbled on the encryption key.

Moreover, even with the key in hand an attacker would've had to then figure out where the key was being used (which happens in a separate repo); what it was being used for; and where the relevant encrypted emails were made available; none of which is obvious unless one is already familiar with the project.

So while I don't have any way to tell with certainty if anybody actually went through the process of decrypting the encrypted emails and correlating responses with them, I personally think the probability of this happening is fairly low. But I apologize for not being able to give you more certainty.

Steps Taken

I've taken the following steps:

  1. Stop using the leaked encryption key.
  2. Make the repo private so that the encryption key is not accessible anymore.
  3. Take down the public datasets containing the encrypted emails until I can re-upload versions without them.

Note: if you happen to have a copy of the datasets or are hosting a mirror, please get in touch or delete your copies if you can!

In the future, I will also focus on making it possible to complete the survey without having to provide an email, which is something that survey respondents have often asked for.

Ironically enough, the leak happened in the process of migrating the survey app to a newer, more robust codebase in order to make it easier to change the way accounts work.

Going Forward

The surveys are an open-source project, created in the open by a mostly-volunteer group of contributors from around the world. And while this can sometimes make it tougher to properly coordinate and avoid situations like this one, I also think being community-driven is one of the project's major strengths.

So while it's totally understandable if a leak like this one makes you question sharing any data with us in the future, I hope you'll be able to give the project another chance.

And if you're not fully comfortable sharing personal information just yet, here's a reminder that you can always skip any question in any survey. Another thing that might put you more at ease might be to use an email alias that can't easily be tied back to you.

I deeply apologize again, and if you have any questions about this whole thing, just leave a comment here and I'll do my best to answer.

Note: I am very grateful to Troy Hunt for pointing me to this great article about the proper way to handle such matters. I recommend it if you ever end up in the same situation!

Discussion (34)

Collapse
jespertheend profile image
Jesper van den Ende

Thanks for being so transparant about this! I reckon most companies don’t even bother disclosing anything until they know for certain data was actually decrypted by someone. Hell I’ve seen companies actively downplay the severity of a situation even when they know for sure passwords have been leaked.

Collapse
sachagreif profile image
Sacha Greif Author

Well, without people’s trust the surveys can’t really work. So I’ve always tried to do everything in the open from the start. Thanks for the kind words!

Collapse
jvdl profile image
John van der Loo

Thanks for the transparency and clear communication. I would imagine it's a tough and nerve-wracking experience to post this article, so thank you also for your courage to show the (IMHO) right way to handle this.

A+++ would answer survey again.

Collapse
td540 profile image
td540

Honest mistake, commendable recovery. (Who’s ever gonna misuse that data anyway. Let's hope only people who still use too many float:left's and too many !important's get spammed with beginner CSS tutorials! Sorry stupid joke.)

Collapse
tehmoros profile image
Piotr "MoroS" Mrożek

An e-mail address in combination with your development preferences could be used to target customized phising attacks agains Devs. We can be an attractive target, given IT is one of the best paying industries out there. That being said, we're also one of the most aware and thanks to Sacha's quick and honest reaction, we're now aware that things like that can take place.

Collapse
matronator profile image
Matronator

"given IT is one of the best paying industries out there."

Lol, not where I work at... 😅🥲😣

Collapse
fegvilela profile image
Fernanda Vilela • Edited on

Human errors happen all the time, unfortunately. On the other hand, transparency is a rare value, thank you very much for being worthy of trust because of your honesty!

Collapse
stevealee profile image
SteveALee

+1000

Collapse
cherryblossom000 profile image
cherryblossom000

Thank you for being so transparent and honest about this! Everyone makes mistakes and I appreciate the effort that goes into these surveys every year.

Collapse
sachagreif profile image
Sacha Greif Author

Thanks for your kind words!

Collapse
lachy profile image
Lachlan Hunt

Please ensure you consult experts on security and privacy before choosing a new approach, and also seek community feedback once you come up with a new plan.

For example, it’s not enough to simply use an ordinary one way hash of email addresses, because nothing stops an adversary simply applying the same function to some publicly known email addresses and looking for matches in your dataset. I suspect this is probably what the original developer had in mind when they chose an encryption function instead.

Collapse
sachagreif profile image
Sacha Greif Author

Yes, we will not publish hashes at all going forward. We do need to store one way email hashes privately for log in purposes, but they won’t be part of any public dataset.

Collapse
daniel15 profile image
Daniel Lo Nigro

This is a good reminder to not store encryption keys in a repo. Ideally use something like Hashicorp Vault, but at least don't store them in files within the repo.

Collapse
stevealee profile image
SteveALee

Hosting systems like netlify, azure etc let you provide secrets via their UI and can be accessed from code through the process environment (process.env in node)

Collapse
sachagreif profile image
Sacha Greif Author

I'm not a huge fan of this solution either (it can lead to a lot of unsecure copy/pasting into Slack or Dropbox when you need to share the secrets, multiplying the number of places the secret exists) but it's true it would have avoided the problem in this specific case.

Thread Thread
stevealee profile image
SteveALee

It always comes back to that human error of the postit on the monitor with password. Lol

Collapse
redbar0n profile image
Magne

what was the original motivation to « track how a given respondent's answers were evolving over time»?

Collapse
sachagreif profile image
Sacha Greif Author

Let’s imagine that in 2020 Famous React Developer Foo shares the survey and brings in their audience; and then in 2021 Famous Vue Developer Bar shares the survey and in turn brings in their audience. In theory you could have shifts in survey answers just because different people are answering the survey. By adding these ids my idea was that you could isolate a constant cohort of respondents if you wanted to remove the influence of audience shifts.

Collapse
redbar0n profile image
Magne

Thanks for the quick reply. Two concerns:

People generally have the expectation that such surveys are anonymous and that results are only gathered in aggregate. It is also safer. It looks like you have realized the value of these propositions.

So: Wouldn't you still be able to achieve your intention by survey respondents revealing their previous framework exposure? Like checking "I have mostly experience with..." React / Vue / Angular, etc. Then you could see the influence of audience shifts.

Thread Thread
sachagreif profile image
Sacha Greif Author

I don't think that achieves quite the same thing. I think the simplest solution is to have two datasets, one without any kind of identifiers for the general public and one with (secure) identifiers which we would only make available to data researchers who want to specifically do a cohort analysis if they get in touch with us.

Thread Thread
redbar0n profile image
Magne

What would be the limitation? Unless you actually want to model the relationships between the influencers and their audiences, I don't see how you actually need to track personally identifiable information to track trends in demographics...

Thread Thread
sachagreif profile image
Sacha Greif Author

If we want to track how cohorts evolve over time then we should just track that in a secure manner; or not track it at all if we can't do it right. It just seems like a simpler approach than finding some other more "fuzzy" metric to use as a proxy.

Thread Thread
redbar0n profile image
Magne

The question is if it's really necessary to track 'cohorts' per se? With the security risk and disfavored UX it entails. If you can get a decent enough statistic from other more aggregate means.

Collapse
brianpeiris profile image
Brian Peiris

I appreciate the disclosure and the transparency, and I sympathize with the incident. However, I don't see key steps in this post that would make me trust the survey going forward. To be blunt, the fact that you mistook a 2-way encryption for a hash makes me think that you do not have the security expertise to be responsible for this data.

The "Steps Taken" section still talks about mitigating the encryption mechanism. Is that the same 2-way encryption that caused the issue? Why isn't the first step to remove the encryption mechanism and replace it with a 1-way hash? If you still need to continue using keys, is there a better option for key management than simply making the repo private? The "Going Forward" section doesn't mention security improvements at all.

Before I'd trust the surveys again, I'd like to see you talk about third-party security audits, and how you're going to verify security-related contributions going forward.

Collapse
bblackwo profile image
Benjamin B

Thank you for the write up.

I don't think I received an email about this but a friend who also took the survey said he got an email about it. Were all participants emailed about the data breach?

Can you explain what a "ghost commit" is?

Collapse
sachagreif profile image
Sacha Greif Author

Yes, all participants were emailed. Maybe you unsubscribed from the mailing list in the past?

And as I understand that "ghost commit" was a commit that was not part of a branch or linked from anywhere on GitHub but still independently accessible if you had the direct URL.

Collapse
milan_jaros profile image
Milan Jaroš

You know you can use BFF, right?
docs.github.com/en/authentication/...

Collapse
stevealee profile image
SteveALee

I've done it too, but in luckily on a low exposure system. I seen to recall finding a way to strip the orphan commit, but probably had to recreat the GH repo. I also seem to recall GH also added some checks for secrets, but I guess not foolproof.

We all make mistakes so it best to try to mitigate, even at the expense of DX. Eg tighten up access permissons so no rm -rf /, don't use eval() or otherwise make ìt hard to parse expressions that may contain unsanitised user input (eg JSX dangerouslySetHTML())

Collapse
mikaelgramont profile image
Mikael Gramont

Thanks for the healthy handling of this issue.

Collapse
mikieos profile image
Mikie O'Sullivan

I know others have said it, but really appreciate the transparency on this. You could have easily decided the risk was small and kept it to yourself. More people need to have your progress mindset :)

Collapse
aarongoldenthal profile image
Aaron Goldenthal

Transparent disclosures are always appreciated. You may also want to look at tools like gitleaks to prevent secrets from being committed.

Collapse
ericburel profile image
Eric Burel

Yes we are setting it up: github.com/VulcanJS/vulcan-next/is...
It's not so obvious to setup though, and I still need to test if it actually would have caught this one leak for instance, or more probables one (eg leaks in dotenv files), which explains why this tool is not common enough

Collapse
bfunc profile image
Pavel Litkin

Everything is fine if lesson was learned)) Sometimes such things can happen with best of us..

Collapse
ericburel profile image
Eric Burel

Twitter thread if you have any additional questions: twitter.com/ericbureltech/status/1...