DEV Community

loading...
Cover image for How the Resilience Score Algorithm works in Litmus!
LitmusChaos

How the Resilience Score Algorithm works in Litmus!

Sayan Mondal
A coffee lover ☕ and explorer 🌏. In my free time I like to write Code and help the community out.
・4 min read

What really is Resilience? For practitioners, psychologists, etc, Resilience is the process of adapting well in the face of adversity, trauma, tragedy, threats, or significant sources of stress, for an SRE or Chaos Engineering however Resilience can be defined as the ability of a system to fail gracefully in the face of—and eventually recover from—disruptive events.

Litmus

Litmus is a Cross-Cloud Chaos Orchestration framework for practising chaos engineering in cloud-native environments. Litmus provides a chaos operator, a large set of chaos experiments on its hub, detailed documentation, and a friendly community.

Find the video format of this blog here:

In this blog, we'll deep dive into how the Resilience Score is calculated for your Workflows in Litmus and also understand the concept of Weights.


Weights

Weights in Litmus

You might have often seen the term weightage pop up quite a few times when you construct your Workflows in Litmus. Typically attached to Resilience, these weights have an important role to play to determine the appropriate Resilience score for your use case.

Giving a weightage to your experiment is a way of signifying/attaching the importance/priority of that experiment in your workflow. The higher the weight, the more importance it holds.

For instance, consider this example where you have two Chaos Experiments, Pod Delete and Pod Network Loss in your workflow. Imagine you have a use case where you cannot bear having a network loss but an eviction of a pod does not really bother you that much because you are confident that the pod would re-spawn back up, but you still want to test for pod eviction as a part of your Chaos Test Suite.

In such a scenario, the Pod Delete Chaos Experiment doesn't hold much of an importance for you but on the other hand, Pod Network Loss does.

The weight priority is generally divided into three sections:

  • 0-3: Low Priority
  • 4-6: Medium Priority
  • 7-10: High Priority

Therefore considering your scenario, you would want to assign Pod Delete with a Low Priority weightage, whereas, Pod Network Loss would be in the High Priority category.

Now that we know why we set the weights the way we do! As we progress further in the blog, we'd also get to know how they actually come into play to determine the Resilience Score.


Alt Text

Resilience Score

A Resilience Score is the measure of how resilient your workflow is considering all the chaos experiments and their individual result points. This calculation takes into account the individual experiment weights (from a range of 1-10) which are relative to each other.

Once a weight has been assigned to the experiment, we look for the Probe Success Percentage for that experiment itself (Post Chaos) and calculate the total resilience result for that experiment as a multiplication of the weight given and the probe success percentage returned after the Chaos Run.

Total Resilience for one single experiment = (Weight Given to that experiment * Probe Success Percentage)
Enter fullscreen mode Exit fullscreen mode

If an experiment doesn't have a probe in it, the probe success percentage returned can either be 0 or 100 based on the experiment verdict. If the experiment passed then it returns 100 else 0.

The Final Resilience Score is calculated by dividing the total test result by the sum of all the weights of all the experiments combined in a single workflow.

Let's take our above scenario again, considering we have given a weightage of 2 to Pod Delete and 10 for Pod Network Loss, this is how the Resilience Calculation would look like.

Resilience Calculation

Considering Probe Success Percentage is 100

Here is why weights play such an important role in your use case specific Resilience Calculation. For the scenario mentioned above, even if Pod Delete would have failed for whichever reason, your Resilience Score would only drop down by a few numbers.

Resilience Score = Total Test Result / Weight Sum 
                 = (0 * 100) + (10 * 100) / 12                  
                 = 1000 / 12 
                 = 83.33%
Enter fullscreen mode Exit fullscreen mode

However if they both were of the same weights, lets say 10. Then your Resilience Score would drop to 50% straight.

With that I hope you are now an expert of weights, and would use them wisely. With great power comes great responsibility.


Conclusion

That's all folks 👨‍🏫, Thank you for reading it till the end. I hope you had a productive time learning about Litmus and how you can construct your workflows with more confidence now.

Thanks

Contribute to LitmusChaos and share your feedback on Github. If you like LitmusChaos, become one of the many stargazers here.

Join the LitmusChaos slack community following these simple steps!

Step 1: Join the Kubernetes slack using the following link: https://slack.k8s.io/
Step 2: Join the #litmus channel on the Kubernetes slack or use this link after joining the Kubernetes slack: https://slack.litmuschaos.io/

Looking forward to having you in our community and learning together!

Discussion (0)