DEV Community

Dawid Kędzierski
Dawid Kędzierski

Posted on

The quick story of Crash-only System

Some time ago I was working on a project that checked all the boxes for being absolutely awful. One of the biggest obstacles was emitting hundreds of events with every single user action. Obviously, it caused many unpredictable issues. Issues which should have been handled properly. Or eliminated. Both of which mean laborious effort, often costly and threatening for the project. What if there was a third path?

In the mentioned project we had diagnosed that one functionality accounted for 20% of the overall number of bugs. Just one! Moreover, almost any attempt to use it ended up with a bigger or smaller issue. A lot was done to improve it. Many talented developers had to admit defeat. And, what’s even worse, every single try, every single developer, left various artifacts behind in the source code which added fuel to the fire by increasing its complexity.

Still, we couldn’t give up. It was producing too many issues.

The functionality was characterized by refreshing the whole application, almost rendering it anew, among other things. Following that lead, we measured the average cycle time and adding to it initial bootstrapping time during the first application run. It turned out that it was 500ms. It’s a lot, and I mean a lot a lot, so we started looking for possible improvements. Our goal was an ambitious one, it assumed going down below 30ms which stands for maximal permissible macro-task time in Chrome. We reached it a few weeks later. Then it was time for phase two - each time an error occurred in the functionality, we wanted to turn off and on the system, restarting the whole state of the application including all the errors. We could do it, because the application didn’t require the persistent state. If it takes less than 30ms, a user won’t even notice it, right? Quickly turned out that 30ms is long enough to notice a “blink”, that’s why we used a trick known from OpenGL among others - we loaded next frame in the background of the old one and then switched them. In other words, the new instance of the application hid behind the old one until it was ready to be displayed. The whole operation took slightly more than 30ms.

A few weeks later it appeared that we implemented Crash-only System totally unconsciously and unintentionally. Quoting Wikipedia: “Crash-only software refers to computer programs that handle failures by simply restarting, without attempting any sophisticated recovery.” Isn’t it mind-blowing? Instead of spending hundreds of hours trying to develop proper error handling, never being sure if all the cases were covered, we can approach the problem from a different angle - assume that if there’s an error the service would be able to reboot quickly enough without interfering the proper functioning of the system as a whole.

But we went one step further! Based on statistical data, we quickly calculated the probability of a bug occurrence within the functionality. The probability was so high that we decided to further simplify the new approach to the problem - instead of maintaining the source code responsible for diagnosing the moment of bug occurrence and restarting the system, we assumed that the problem will occur for sure so we reboot the application immediately.

I know what you’re thinking. Everything here is one big lie, because in fact we didn’t fix anything, we only hid the problem which we couldn’t solve, right? Having said that, the fact is we: (1) have significantly boosted application performance; (2) from the technical point of view the system works properly right now, because it meets the users’ expectations; (3) over a course of a few weeks, we have reduced the overall number of bugs by 20%. And the solution, although it might seem surprising at first, it turned out to be trivial and well-known for everyone - the only thing we needed was just to turn it off and on.

Top comments (0)