loading...

I broke login for 80k users

rhymes profile image rhymes Updated on ・5 min read

Let me start by saying that I'm not happy about this :-D

The following is a post mortem of carelessness, naivety and bad changelogs and upgrade guides.

What happened

A client of mine has a mobile app with a login that uses OAuth2 Password Grant. It's an app both for Android and iOS.

We started receiving complaints about users that couldn't login so I went into "investigative mode" (which lasted the entire day until now).

The app is the "classic" app nobody really cares about made by numerous system integrators that talks to many servers, among them this API I wrote for the login part. I honestly have no idea how the mobile app works internally, I've never seen the code but I wrote the server side API and it has worked for many years.

The Rails server also has been successfully migrated across all 4.x versions and to 5.2, still the tests we did always worked. The Rails server is the backend of various different mobile apps sharing common functionality. Definitely pre Firebase or App Sync :D

Going back to the main issue: some users couldn't login. I started talking to the mobile developer that inherited the code of the app (he's not the original coder) and we discussed the situation.

Note: this is many weeks after the first complaints. August is a dead month in Italy. The mobile team was on holiday, the system integrator was on holiday, my client's client was on holiday. Probably also the poor users were on holiday and trying to use the app from there.

At the same time other parts of the app stopped working. Again, I have no idea how it actually works, I only see the API logs of the traffic they make on one server but I know the code more or less by memory and there was definitely no reason the rest of the app had anything to do with this functionality. Well unless there's poor exception handling and flows I'm not privy about but that's total speculation.

I asked the usual questions: did you recently released a new version? Can we have a log of the actions of the user (which they don't have, a story for another day) and so on.

After receiving an account on the help desk app (yay) we noticed a pattern: only Android devices were failing: potentially 80 thousand users, in reality they received 400 complaints. Still, a lot but it could have been way worse. I guess even users are chill about this app.

Sherlock moment

I then downloaded 40 days of logs and started using ripgrep (written in Rust, best grep tool ever with look ahead and look behind) to find clues about what was going on. I found anomalies in the the pattern of API calls made by the REST clients.

After that I convinced the mobile developer to start debugging the app.

I also noticed that the day before the first emails I released an update for doorkeeper, Rails OAuth2 server (we'll get back to this later). I reverted that but still... it didn't work.

I asked the mobile developer to show me the login code (I haven't read Java code in years :D). After a few minutes I discovered a bug in the client: if the refresh token is expired the code raises an exception but fails to login again.

We decided this could have been the case and ask one of the customers to reinstall the app, confident that a new set of oauth tokens would do the trick. The customer comes back a few minutes later saying it doesn't work. The fact that the mobile team doesn't even use this app tells you a lot, we had to ask a customer through the help desk to reproduce the error...

I start to sweat a little. In the meantime iOS goes on without a breeze. I know for sure that my Rails app doesn't make any distinction between platforms. A REST client is just a REST client, regardless of platform.

I go back to Doorkeeper's GitHub. I start re-reading the changelog, the upgrade guide (even though I reverted the version in the meantime) and the issues and a couple of things starts to feel weird:

doorkeeper 4.4.0

which points to issue 1120 a backport of a security issue about the ability to revoke tokens.

I think "well, what does it have to do with me?", we're not having issues with revoking access tokens anyway. Also, what is this "confidential" thing they are warning about all of a sudden?

In the upgrade I notice another clue:

doorkeeper 4x to 5.0

Doorkeeper::Application now has a new boolean column named confidential that is true by default and has NOT NULL CONSTRAINT. This column is required to allow creating Public & Private Clients as mentioned in Section 8.5 of draft-ietf-oauth-native-apps-12 of OAuth 2 RFC. If you are migrating from Doorkeeper <= 5.0, then you can easily add this column by generating a proper migration file using the following command: rails g doorkeeper:confidential_applications.

Please tell me that is totally unclear to you too.

I end up skimming the OAuth2 RFC mentions because nothing in the changelog and in the upgrapde guide is clear enough.

In the meantime the mobile developer sends me the the API call the Android app is doing, in its curl version.

DING DING DING.

I instantly notice something: they are not sending the client secret. Wait what? They do a git blame on the code and tell me "the client secret part has been commented out for two years".

Then I'm sure the issue is somewhere on the server. That is apparently the only difference between Android and iOS.

DING DING DING.

This is when I get the intuition. The Doorkeeper upgrade guide says a column named confidential that is true by default. I open the production console, flip the flag to false and BOOM. Everything works.

Conclusion

  • The implications of the flag confidential are totally unclear from reading the changelog. Especially because the fix talks about revocation, not about making the client secret mandatory
  • I should have spent way more time reading the diff (I didn't because I trusted the changelog)
  • The flag confidential should probably be false by default
  • Please please please Flutter make all of this nonsense of writing the same app twice with different bugs go away

As I said in the beginning the fact we took so long to figure this out is a combination of carelessness (I applied the security fix withouth reading the diff, just the changelog), naivety (I assumed the security fix about revocation was innocuous and I didn't test it in depth) and the fact that this app is definitely not a priority for anybody.

I have no moral of the story, I opened an issue on Doorkeper's GitHub: https://github.com/doorkeeper-gem/doorkeeper/issues/1142

Update from 12th September 2018: the issue ticket resulted in a clearer explanation in the wiki and changelog, yay :-)

Posted on by:

rhymes profile

rhymes

@rhymes

Software developer @ Forem

Discussion

markdown guide
 

Let me tell you a story about breaking the login page. About 10 years ago, two of my colleagues at that time were supposed to delete around 10 test users from database, we had no clue about soft deletes (nor foreign keys for that matter) at that time so they decided to do a serious delete... I am sure you know where this is going. Both of them stared at the screen and confirmed multiple times that the delete clause is bullet proof, hit execute. And deleted 50k users from database!

Now here comes the fun part, the ONLY backup we had at that time was 3 days old on my laptop. So in the end they still partially saved the day and we lost 3 days worth of new users

 

ahahahah two human beings being sure that the change is safe. That's definitely a sort of an "echo chamber" right there :D

 

And that's exact reason why you run select first with the same condition to see what will be deleted. 🤔

 

Yeah, this is why I stick with firebase!

 

Firebase wasn't even around when we implemented that :-D

At the time there were also a series of issues about ownership and privacy of user's data.

 

sights this youngsters and their services in "the cloud"....

Just remembering an anecdote not unlike yours, when a friend deleted our whole user data. I was talking to some junior developers about it and one of them looked at me and said: "Why didn't you just kept your users in our PaaS service like we do now?"

 

Got me cracking right here:

Note: this is many weeks after the first complaints. August is a dead month in Italy. The mobile team was on holiday, the system integrator was on holiday, my client's client was on holiday. Probably also the poor users were on holiday and trying to use the app from there.

Thanks, what a day maker.

 
 

Hi Rhymes! I just read your post here : github.com/doorkeeper-gem/doorkeep...
I have some followup questions though. You mentioned that the android app is not sending the client secret. In what situations is that possible? Are you using implicit grant type in this case?

 

I'm sorry, I'm not familiar with the app internals anymore. I don't remember how that happened :)