Hey everyone, in this blog, I am going to talk about banking-related failures and how we can overcome them using LitmusChaos.
Have you ever come across a banking failure?
Today due to mobile banking apps, for instance, allow you to deposit checks from the comfort of anywhere. At the same time, you can check your balance, transfer funds, and set up a notification to alert you if you overdraft your account without a need to visit a branch. It is a real time-saver!
But at the same time, if the bank’s server is down or the connection is slow, you would not be able to access your accounts, and it would be difficult to know if your transaction went through.
Bank Of Anthosand its relation with LitmusChaos
- Bank-Of-Anthos-Resiliency Workflow
- Getting Started with Litmus
- Injecting Chaos and observing results.
- Conclusion, References, and Links.
Bank of Anthos is a sample HTTP-based web app that simulates a bank's payment processing network, allowing users to create artificial bank accounts and complete transactions.
Google uses this application to demonstrate how developers can modernize enterprise applications using GCP products.
The architecture of Bank Of Anthos
In LitmusChaos, we have been working and testing our experiments on
Bank Of Anthos applications. Chaos Experiments contain the actual chaos details.
We have two scenarios: weak and resilient. Based on that, we will perform our experiment workloads and try to find an effect on the cause.
This workflow allows the execution of the same chaos experiment against two versions of the
bank-of-anthos deployment: the weak ones are expected to fail while the resilient ones succeed, essentially highlighting the need for deployment best-practice.
Litmus is an open-source Chaos Engineering platform that enables teams to identify weaknesses & potential outages in infrastructures by inducing chaos tests in a controlled way.
Developers & SREs can execute Chaos Engineering with Litmus as it is easy to use, based on modern chaos engineering practices & community collaboration. Litmus is 100% open source & CNCF-hosted.
Litmus takes a cloud-native approach to create, manage and monitor chaos.
Chaos is orchestrated using the following Kubernetes Custom
Resource Definitions (CRDs):
- ChaosEngine: A resource to link a Kubernetes application or Kubernetes node to a ChaosExperiment. ChaosEngine is watched by the Chaos-Operator that invokes Chaos-Experiments.
- ChaosExperiment: A resource to group the configuration parameters of a chaos experiment. ChaosExperiment CRs are created by the operator when experiments are invoked by ChaosEngine.
- ChaosResult: A resource to hold the results of a chaos experiment. The Chaos-exporter reads the results and exports the metrics into a configured Prometheus server.
Chaos experiments are hosted on hub.litmuschaos.io. It is the central hub, where the application developers, vendors share their chaos experiments so that their users can use them to increase the resilience of the applications in production.
Pod-network-loss injects chaos to disrupt network connectivity to
transactionhistory Kubernetes pods.
The application pod should be healthy once the workflow run is complete. Service requests should be served, despite the chaos. This experiment causes loss of access to application replica by injecting packet loss using litmus.
Network loss is done on a
transactionhistory pod that reads from
ledger DB and runs as a single replica deployment and not as a multi-replica in weak scenarios. Due to loss,
transactionhistory are inaccessible (or Read-Only), resulting in failed requests.
In a single replica deployment, the user is not able to - see the details, add the details or get information. Whereas in two deployments it is possible to add details. The
transactionhistory details can be added and retrieved via provided APIs.
The HTTP probe allows developers to specify a URL that the experiment uses to gauge health/service availability (or other custom conditions) as part of the entry/exit criteria. The received status code is then mapped against an expected status. It can be defined at
.spec.experiments.spec.probe, the path inside ChaosEngine.
We have been using HTTP probe again front end URL
http://frontend.bank.svc.cluster.local:80 with the probe poll interval of 1 second.
We have been using HTTP probe again front end URL
“http://frontend.bank.svc.cluster.local:80” with probe poll interval 1 sec.
- name: "check-frontend-access-url" type: "httpProbe" httpProbe/inputs: url: "http://frontend.bank.svc.cluster.local:80" insecureSkipVerify: false method: get: criteria: "==" responseCode: "200" mode: "Continuous" runProperties: probeTimeout: 2 interval: 1 retry: 2 probePollingInterval: 1
In a weak scenario, only one replica of the pod is present. After chaos injection, it will be down, and therefore accessibility will not be there, and eventually, it will fail due to frontend access.
In a resilient scenario, two replicas of pods are present. After chaos injection, one will be down. Therefore, one pod is still up for accessibility and will still be there to pass due to frontend access.
The probe will retry for timeout duration for accessibility change. If the termination time exceeds the timeout without accessibility, it will fail.
Litmus-Portal provides console and UI experience for managing, monitoring, and events around chaos workflows. Chaos workflows consist of a sequence of experiments run together to achieve the objective of introducing fault into an application or the Kubernetes platform.
kubectl apply -f https://raw.githubusercontent.com/litmuschaos/litmus/2.0.0-Beta9/docs/2.0.0-Beta/litmus-2.0.0-Beta.yaml
After installation is complete, you can explore every part of the Chaos control plane and predefined workflows.
You can edit YAML and see the app deployer:
- name: install-application container: image: litmuschaos/litmus-app-deployer:latest args: - -namespace=bank - -typeName=resilient - -operation=apply - -timeout=400 - -app=bank-of-anthos - -scope=cluster
Here, in the application installation step for weak, you can change
-typeName=weak that results in a single replica.
Note: Make sure you have passed the correct docker path in the ChaosEngine.
It can be cross-checked by using the UI.
- We can see this in real-world scenarios also where we are unable to check our balance.
We can see that the application has passed even after network loss. Due to the high availability of replicas, the probe has passed.
Yes, the balance is there, and you can see the logs and
Chaos Resultsbelow it.
We have briefly discussed the
Bank Of Anthos resilience predefined workflow where the
Bank Of Anthos application resiliency and its relationship with LitmusChaos has shown.
The workflow has been discussed in the Introduction Bank Of Anthos Resiliency Workflow. In LitmusChaos, we are continuously working and testing our experiments applications.
Among the different predefined workflows and kubernetes workflows implemented, our 3rd predefined workflow is Bank Of Anthos-Resiliency-Check, which has been elaborated with scenario effects and its causes.
Thank you for reading it till the end. I hope you had a productive time learning about LitmusChaos. Are you an SRE or a Kubernetes enthusiast? Does Chaos Engineering excite you?
Join the LitmusChaos Community Slack channel by joining the
#litmus channel on the Kubernetes (https://slack.k8s.io/) Slack!
Submit a pull request if you identify any necessary changes.
Do not forget to share these resources with someone you think might need help to learn/explore more about Chaos Engineering.