It's two o clock, and I'm minding my own business, working on some bugs and listening to my favorite YouTube playlist...when I get a call on my phone. It's from the CEO.
"Hey, what's up man??" I say, a little hesitation in my voice, hoping everything is alright.
"Adam, we have a really weird thing going on right now [in the background, I can hear the DEAFENING sound of phones ringing off the hook]."
"OK, what's going on?"
"Well, log into [the app], and look at our incoming calls. We're getting HUNDREDS of calls, and it just started about ten minutes ago. What's going on?"
Well, I did log into the app, and he was right, we were getting HUNDREDS of inbound calls, and they were coming in more and more by the second. This is INCREDIBLY unusual, as we never get more than one or two calls at a time, and NEVER more than one hundred calls a day. Something was definitely wrong.
"I don't know what's going on," I say, as my blood pressure and heart rate rise to a speed I didn't know possible, "I will get back to you as soon as I can with as much information as possible."
"OK, thanks. Please, as soon as possible."
We hang up.
This is the story of how we fought it off, how we survived, and what can be learned from the terrifying experience of someone or something potentially jeopardizing your system.
I run a development team that primarily works on the sales CRM for a pretty established company in the Midwest in the United States. This is a sales platform where people fill out a form on our website, our Facebook page, etc...to say "I want to know more" and then once they fill that out, they are pushed into our CRM to be called and/or texted. In addition, we run ads on TV or radio saying "Call this phone number" and once they do call it, they are put into our CRM as well. We use Twilio for our VoIP/SMS/MMS backbone.
This system is the backbone of how we sell our product and when it is in peril (as it was) every single person in the entire company is affected.
This problem had multiple angles, but what it really was, was there was some actor that somehow had gotten enough control of our phone number that they were able to make calls with our phone number to thousands of other phones and leave missed calls on their phones. That was the problem.
And this is such an important part of combating and surviving an attack such as this. Understanding the difference between a problem and a symptom is so crucial.
The main symptom of the problem was that hundreds of people called our phone number back, because they believed that we had called them, thereby overloading our front desk people and everyone else in the company who was using our system. While this is bad, it was not the problem and differentiating between a problem and a symptom is incredibly important in a time like that one.
This problem is definitely not an easy one to fix, but time was working against us. We were receiving hundreds of incoming calls every second, ballooning our server logs as the minutes ticked by, and we were about to start to get into dangerous territory of our app being taken offline, or worse.
I knew in my head that the main problem could not be fixed right then. There were too many steps and too much red tape to go through with Twilio to stop the main problem (ultimately fraud) right then and there, and so I quickly turned my power to stopping the symptom.
I turned off the phone number.
This had a few ramifications.
- Most importantly, the incoming calls stopped.
- This was (and is) a phone number that runs in ads on radio stations. Somebody could potentially call that number, and we could miss out on a lead or a sale. Luckily, that phone number hadn't gotten a real call in over two months and so there wasn't much of a chance that this phone number was going to receive any real lead calls. But had it happened with a different phone number, we could have lost out on business.
- It didn't fix the problem, but it did fix the largest symptom, serving to make everyone think "whew, the problem is fixed." Because even while myself and my team know that's not the case, it makes everyone think the problem is fixed, when in fact, the underlying problem wasn't fixed at all.
Myself and my team have alerted a few Twilio channels (including their fraud detection line) to no response yet. We have and will continually reach out to get a response, but we may have to get rid of that number altogether from our Twilio account.
Sitting there about thirty minutes after it happened, I felt like I had aged ten years. I took it as a good teaching moment, especially for my junior developers, who had never seen something like this happen before, and here were the main takeaways I tried to communicate:
Gather as much information as possible, and then step away to get it done. It serves nobody to stay on the phone with anyone who cannot give tangible technical help, just so that they can get "up-to-the-nano-second" updates. You as the engineer need the space to be able to get to a fundamental understanding of the problem, and then get everyone either off the phone or out of the room who cannot help. With the promise (and the implicit trust) that you will provide them updates as they arise. This was the first thing I did. Got an understanding of the problem and said "I don't know what's going on, but I will get back to you as soon as I can with as much information as possible." And the phone was hung up.
Mitigate immediate threats. While not the main problem, the incoming calls were immediate threats to the system, threatening to potentially take us offline, which would severely hurt the business. I don't care if the mitigation strategy is to turn off the servers, if you have to go nuclear, go nuclear. Luckily, in this case, mitigating the threat was as simple as turning off (or un-routing) the phone number that was being called.
Work backwards and eliminate places where the system is not failing. Draw out on a piece of paper or a whiteboard every point of action between the bad actor and your system. Then begin to eliminate the places that you can prove are not comprised. This serves two purposes.
One - To help you focus on what truly the problem is.
Two - To help you maintain confidence and trust in what you're working on, so that you don't begin to fall into an anxiety induced panic, which is a very real thing that can happen in situations like this. Working through things like a checklist (checking off points of action that are not comprised) is therapeutic and really enables you to keep your head clear and focus on the problem, and not just that there is a problem.
- Gather the correct people and hold a breakdown meeting. This part is so crucial for multiple reasons:
One - It is very important to maintain the trust of the clients you're building your product for. At the end of the day, the people you build for really just want to know that the people they entrust their business to are responsible enough to handle it. Maintaining an open communication is crucial to maintaining trust when things like this happen.
Two - It is very important that you gather everyone who needs to be there so that everyone who needs to gets the information at the same time. There is no reason you need to spend the rest of your day on the phone with everyone, explaining the same thing fifty times. Not only is that frustrating to you, but it is also frustrating to the person who thinks they should have known "first", but ends up being the seventh person to know.
Three - As long as you were honest with what happens, it allows everyone to be on the same page, with no questions left in people's heads. Being proactive about this meeting is how you accomplish this last piece. If you are the one asking to have this meeting, and asking to put it together, and not your project manager, that puts a lot of trust in you as the engineer, that your main focus is getting the problem fixed, and not covering your own back.
At the end of the day, it could have been a lot worse. We were able to prove that nobody gained un-authorized access to our Twilio account or our Sales CRM, and we have narrowed it down to a rogue actor committing fraud with one of our phone numbers. We are still waiting on confirmation and next steps from Twilio on that one, but that is our hypothesis right now.
All applications have been or will be hacked. If it hasn't happened to you yet, just wait. But it's not the end of the world. Stay calm, identify the problem, mitigate the threats, and learn from it. That's all we can do, and if we do that well, everything is gonna be alright.