Self healing code

#programming

Yup, a grandiose title, but something to think about and try for when it makes sense.

What I mean by "self-healing code" is writing code such that when a problem happens the code automatically reacts in such a way that the current user is unaware of the problem and future users do not trigger the problem.

A pretty common pattern that does this something like this is the Circuit Breaker, although I suggest taking it further. The circuit breaker simply returns an error once it trips. Self-healing code ideally does something more helpful to the user.

Let's say you are writing code to find an optimal route between two points. You have a trivial solution that you wrote, but then find the super-duper-always-perfect route as an API online. Now suppose that API can be unstable and not always return a result, especially under load.

A self-healing solution could be to fall back to your trivial solution when you detect a failure. In addition, your application could remember that the super-duper solution is having problems and maybe not send requests its way until a cooldown period has passed.

Your customers still get a route. Maybe not the best route, but some route is likely better than no route. Additionally, except for the first failed call, remaining calls don't waste time going to a broken API and so the response to your customer is faster. Finally, the super-duper solution is given a break to recover. Eventually you start calling the super-duper solution again and all is good. This is basically the concept of Graceful Degradation. Graceful degradation fits into what I'm thinking, but what if there are scenarios where you can return the exact results back to the user after an error rather than maybe not the best route as above? This is the ultimate dream of self-healing code.

Here is another example that I actually went through that doesn't fit the Circuit Breaker pattern and goes further than Graceful Degradation. This led to my thinking on self-healing code.

Since HTTP GET requests should only be doing reads from the database, we figured we could easily distribute traffic between our primary database and our read-replica database by automatically sending DB reads from GET requests to the read-replica.

The problem. We discovered that we were actually writing to the database in our GET requests. Not all of them, but enough to make it an issue. We decided to fix the GET requests to do the right thing so we could go forward with this plan.

The problem. There were enough GETs that wrote to the DB to make it too large of an effort to fix them all. The benefits of the project wouldn’t balance the costs.

The insight. We could keep a "skip list" of GET routes that do a write to the DB. Then, we could automatically send GET requests to the read-replica database unless they are in the skip list.

The problem. Again, we have many GETs that write to the database and no easy search patterns that would assure us that we could identify them all in our codebase.

The self-healing insight: We can default to sending all GET requests to the read-replica database. If a write happens within the processing of that request, it will error out since it can't write to the read-only replica database. Then, we can detect that error and re-run the full request against the primary database. The user will be oblivious to the problem except for a slightly longer response time. The self-healing part is that along with re-running the request, we record this route into the skip-list. Now at most one user (roughly, threading complexities aside) will see a delayed response. All other users will automatically just go to the primary database because the route is on the skip-list.

The extra win. This becomes a comprehensive list of routes that need fixing. As we fix the routes, we can remove them from the skip-list.

This allowed us to immediately start seeing benefits from our work to move traffic to the read-replica database. We can focus on the most common requests that will have the biggest lift, and requests that are so rare they are maybe used a handful of times a day can be deprioritized. We'll fix it eventually because writing to the database on a GET is just wrong, but we don't have to fix every single bad call before our database can breathe a sigh-of-relief.

In the end, this was a big win, and this way of thinking can likely be applied in many other places. The concept of letting the code both gracefully detect an error and find another way of solving the problem is huge. Coupling this with the code remembering the error so it doesn't keep trying takes it to the next level. This can be leveraged in all sorts of refactoring attempts, particularly complicated cross-cutting concerns. Keep this in your back pocket! Any time you can simplify a big-bang solution to small bites, it is almost always worth the effort to do so.

About Jobber
We're hiring for remote positions across Canada at all software engineering levels!
Our awesome Jobber technology teams span across Payments, Infrastructure, AI/ML, Business Workflows & Communications. We work on cutting edge & modern tech stacks using React, React Native, Ruby on Rails, & GraphQL.
If you want to be a part of a collaborative work culture, help small home service businesses scale and create a positive impact on our communities, then visit our careers site to learn more!

Top comments (8)

Peter Ellis • Apr 20 '22 • Edited

As good as this way of solving a problem sounds (and it sounds very good indeed), it's worth remembering that this kind of thing is only really an option at organisations with good operational discipline. Otherwise you'll just end up with a bunch of "temporary bridging hacks" that could in theory be used to fix the root cause but never are.

John Zittlau • Apr 21 '22

Very true. We certainly have room to grow, but will say we do a pretty good job of prioritizing this sort of thing against features. Feature work is critical, but we all know that tech debt can grind a team to a halt if not managed.

Akash • Apr 21 '22

Simply amazing. This is food for thought. Self-healing code is an underrated topic I guess and whenever people do talk about it, it gets complicated cause they talk about the AI angle of it but this is something new and interesting. Also, I was thinking if what you said in this article is limited to certain very specific kinds of errors. For eg:

writing a local custom built library as a fallback to an external API catering to some business logic.
registering the API in the skip list cause you know beforehand that when will and against what input that error will occur and you also know that running it against your primary db will solve it. both these example show that you know precisely when and what will cause an error, and then we write a fallback designed just for that particular error. i think this version of self-healing code cannot be generalised very effectively as the errors are generally very diverse and the fallbacks we would have write will be too specific to each error and then think about the combinations and compatibility among these fallbacks. wouldn't this approach get limited to certain big hiccups you know will cause certain very specific errors beforehand? I was thinking maybe then the approach you described has to be applied to key areas only that the developers have to carefully pick out and identify. The compatibility and inter-connectivity among fallbacks would also be a little complicated to handle. An abstract interface that wraps both the fallback and original approach has to be built and I guess it probably would be beneficial to emit an event across the app that the system has to switch to a fallback so that other dependent parts can adjust themselves if needed.

John Zittlau • Apr 23 '22

Agreed, generalizing this could be tough. I do like your idea of an abstract interface that emits an event. But yes, I believe the developer needs to have a solid understanding of the cause of the error and the implications of re-running (in this case) or more generally, recovering. So this likely needs to be implemented on a case by case basis

hidden_dude • Apr 20 '22

What is your opinion on caching? Wouldn't a cache help in this circumstance?

John Zittlau • Apr 20 '22

Caching is a good thing (except when it isn't :) ). We use caching heavily. Increasing cache size would have a positive effect (and be simpler), but completely splitting out all read traffic was expected to have a much bigger impact and also allow for better long term scaling.