Yup, a grandiose title, but something to think about and try for when it makes sense.
What I mean by "self-healing code" is writing code such that when a problem happens the code automatically reacts in such a way that the current user is unaware of the problem and future users do not trigger the problem.
A pretty common pattern that does this something like this is the Circuit Breaker, although I suggest taking it further. The circuit breaker simply returns an error once it trips. Self-healing code ideally does something more helpful to the user.
Let's say you are writing code to find an optimal route between two points. You have a trivial solution that you wrote, but then find the super-duper-always-perfect route as an API online. Now suppose that API can be unstable and not always return a result, especially under load.
A self-healing solution could be to fall back to your trivial solution when you detect a failure. In addition, your application could remember that the super-duper solution is having problems and maybe not send requests its way until a cooldown period has passed.
Your customers still get a route. Maybe not the best route, but some route is likely better than no route. Additionally, except for the first failed call, remaining calls don't waste time going to a broken API and so the response to your customer is faster. Finally, the super-duper solution is given a break to recover. Eventually you start calling the super-duper solution again and all is good. This is basically the concept of Graceful Degradation. Graceful degradation fits into what I'm thinking, but what if there are scenarios where you can return the exact results back to the user after an error rather than maybe not the best route as above? This is the ultimate dream of self-healing code.
Here is another example that I actually went through that doesn't fit the Circuit Breaker pattern and goes further than Graceful Degradation. This led to my thinking on self-healing code.
Since HTTP GET requests should only be doing reads from the database, we figured we could easily distribute traffic between our primary database and our read-replica database by automatically sending DB reads from GET requests to the read-replica.
The problem. We discovered that we were actually writing to the database in our GET requests. Not all of them, but enough to make it an issue. We decided to fix the GET requests to do the right thing so we could go forward with this plan.
The problem. There were enough GETs that wrote to the DB to make it too large of an effort to fix them all. The benefits of the project wouldn’t balance the costs.
The insight. We could keep a "skip list" of GET routes that do a write to the DB. Then, we could automatically send GET requests to the read-replica database unless they are in the skip list.
The problem. Again, we have many GETs that write to the database and no easy search patterns that would assure us that we could identify them all in our codebase.
The self-healing insight: We can default to sending all GET requests to the read-replica database. If a write happens within the processing of that request, it will error out since it can't write to the read-only replica database. Then, we can detect that error and re-run the full request against the primary database. The user will be oblivious to the problem except for a slightly longer response time. The self-healing part is that along with re-running the request, we record this route into the skip-list. Now at most one user (roughly, threading complexities aside) will see a delayed response. All other users will automatically just go to the primary database because the route is on the skip-list.
The extra win. This becomes a comprehensive list of routes that need fixing. As we fix the routes, we can remove them from the skip-list.
This allowed us to immediately start seeing benefits from our work to move traffic to the read-replica database. We can focus on the most common requests that will have the biggest lift, and requests that are so rare they are maybe used a handful of times a day can be deprioritized. We'll fix it eventually because writing to the database on a GET is just wrong, but we don't have to fix every single bad call before our database can breathe a sigh-of-relief.
In the end, this was a big win, and this way of thinking can likely be applied in many other places. The concept of letting the code both gracefully detect an error and find another way of solving the problem is huge. Coupling this with the code remembering the error so it doesn't keep trying takes it to the next level. This can be leveraged in all sorts of refactoring attempts, particularly complicated cross-cutting concerns. Keep this in your back pocket! Any time you can simplify a big-bang solution to small bites, it is almost always worth the effort to do so.
About Jobber
We're hiring for remote positions across Canada at all software engineering levels!
Our awesome Jobber technology teams span across Payments, Infrastructure, AI/ML, Business Workflows & Communications. We work on cutting edge & modern tech stacks using React, React Native, Ruby on Rails, & GraphQL.
If you want to be a part of a collaborative work culture, help small home service businesses scale and create a positive impact on our communities, then visit our careers site to learn more!
Top comments (8)
As good as this way of solving a problem sounds (and it sounds very good indeed), it's worth remembering that this kind of thing is only really an option at organisations with good operational discipline. Otherwise you'll just end up with a bunch of "temporary bridging hacks" that could in theory be used to fix the root cause but never are.
Very true. We certainly have room to grow, but will say we do a pretty good job of prioritizing this sort of thing against features. Feature work is critical, but we all know that tech debt can grind a team to a halt if not managed.
Simply amazing. This is food for thought. Self-healing code is an underrated topic I guess and whenever people do talk about it, it gets complicated cause they talk about the AI angle of it but this is something new and interesting. Also, I was thinking if what you said in this article is limited to certain very specific kinds of errors. For eg:
Agreed, generalizing this could be tough. I do like your idea of an abstract interface that emits an event. But yes, I believe the developer needs to have a solid understanding of the cause of the error and the implications of re-running (in this case) or more generally, recovering. So this likely needs to be implemented on a case by case basis
What is your opinion on caching? Wouldn't a cache help in this circumstance?
Caching is a good thing (except when it isn't :) ). We use caching heavily. Increasing cache size would have a positive effect (and be simpler), but completely splitting out all read traffic was expected to have a much bigger impact and also allow for better long term scaling.
Wow that was very mind blowing. Thanks a lot. Will put this article on my newsletter thanks :)
An elegant solution. You have me looking at my GETs now.