DEV Community

Cover image for Mastering Chaos Engineering with Azure Chaos Studio
Ivan Porta
Ivan Porta

Posted on • Originally published at gtrekter.Medium

Mastering Chaos Engineering with Azure Chaos Studio

The rise of microservices architecture has sparked significant discussions within the technology industry. Despite the challenges posed by global events like the Russia-Ukraine war and the COVID-19 pandemic, experts project a remarkable growth in the market size of microservices architecture in the coming years. According to the IMARC Group, the market is expected to reach $7.8 billion by 2028. The Business Research Company paints an even brighter picture, predicting a market value of $10.86 billion by 2027. Market Research Future (MRFR) is even more optimistic, forecasting a growth to $21.61 billion by 2030. These projections confirm the widespread adoption of microservices architecture, making it crucial for developers and engineers to enhance their skills in technologies such as Docker, Kubernetes, and REST APIs.

The Complexity Challenge

However, as the adoption of microservices architecture continues to grow, so does the complexity of these systems. Testing these intricated systems made by hundreds of nodes and thousands of microservices is becoming challenging, and predicting failures has become increasingly difficult. Such failures can result in costly outages for companies. According to an International Data Corporation (IDC) report, infrastructure failures can cost large businesses around $100,000 per hour, while critical application failures can range from $500,000 to $1 million per hour. Furthermore, a survey conducted by the Uptime Institute found that nearly one-third of all data centers experienced an outage in 2020.

Image description

To proactively address the challenges posed by the complexity of microservices architecture, an increasing number of companies are turning to Chaos Engineering.

Introducing Chaos Engineering

Chaos Engineering is a proactive testing practice designed by Netflix to test its system stability after it was migrated to Amazon Web Services. Its original purpose was to assess how its system responded when critical components of its infrastructure were taken down. By intentionally inducing failures and closely monitoring the system’s responses, engineers were able to identify weaknesses that may remain hidden under normal operating conditions. Gaining real-time insights into how a system responded under pressure prepared teams for actual failures and helped identify latent bugs.

By purposefully “breaking things,” businesses can improve their ability to find and resolve issues before they lead to costly outages, ensuring the resilience and reliability of their microservices architectures.

Netflix took it a step further and developed an entire suite of automated stress tests of their infrastructure called The Simian Army. However, this is a topic for a different time.

Phases of chaos engineering

As mentioned earlier, Chaos Engineering involves running thoughtful, planned experiments that teach us how our systems behave in the face of failure based on the following phases:

  1. Pick a hypothesis: Before running an experiment, you should have an idea of the possible outcome.
  2. Scope: Choose the blast radius of the experiment. Low-risk experiments involve a few users by injecting failures into a subset or small group of devices. Riskier and more accurate experiments are large-scale without custom routing and have the potential to impact users not in the experiment group through circuit breakers and shared resource constraints.
  3. Run the experiment: Execute the chaos experiment and collect metrics.
  4. Analyze the results: Use the metrics you’ve collected to validate or invalidate the initial hypothesis.
  5. Increase the scope: After gaining confidence from running smaller-scale experiments, you can gradually increase the blasting radius and getting additional insights.

Chaos Experiments and Microsoft Azure

While Microsoft Azure has long supported several third-party chaos engineering tools and services such as Gremlin and Chaos Mesh, it wasn’t until March 13, 2023, that they publicly made available their chaos engineering service with the release of Azure Chaos Studio in East Asia and other regions.

What is Azure Chaos Studio?

Azure Chaos Studio is a service provided by Microsoft that enables you to orchestrate controlled fault injections into your Azure resources, including but not limited to Azure Cosmos DB, AKS, Azure VM, and many others. Depending on the target, it supports different kinds of faults:

  • Service-direct: These faults are directly applied to an Azure resource, without necessitating any installation or instrumentation.
  • Agent-based: These faults are executed within VMs or VMSS to induce in-guest failures.

In the context of AKS, Azure Chaos Studio leverages Chaos Mesh, an open-source chaos engineering platform that empowers users to easily inject failures into an AKS cluster. As of this writing, there are several limitations to keep in mind when considering Azure Chaos Studio. For instance, it only supports Linux nodes, requires local cluster accounts to be enabled, among other considerations.

Integrating Azure Chaos Studio with AKS

In this section, I will guide you on how to integrate Azure Chaos Studio into your AKS cluster and execute an experiment.

Install Chaos Mash in you AKS cluster

The first thing you will need to do is install Chaos Mesh on your AKS cluster. To do so:

  • Get the access credentials for your AKS cluster and merge them into your local kubeconfig file so that you can interact directly with your Kubernetes cluster.
$ az aks get-credentials -g rg-training-aks-uks-01 -n aks-training-uks-01
Merged "aks-training-uks-01" as current context in /home/gtrekter/.kube/config
Enter fullscreen mode Exit fullscreen mode
  • Install Helm on your local machine. It is a Kubernetes package manager that simplifies application deployment and management. It uses packages called “charts,” which are pre-configured Kubernetes resources, to install complex applications into a Kubernetes cluster.
curl https://baltocdn.com/helm/signing.asc | gpg --dearmor | sudo tee /usr/share/keyrings/helm.gpg > /dev/null
sudo apt-get install apt-transport-https --yes
echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/helm.gpg] https://baltocdn.com/helm/stable/debian/ all main" | sudo tee /etc/apt/sources.list.d/helm-stable-debian.list
sudo apt-get update
sudo apt-get install helm
Enter fullscreen mode Exit fullscreen mode
  • Add the Chaos Mesh chart repository to Helm, update your local Helm chart repository cache, create a new namespace in your Kubernetes cluster called chaos-testing, and install a new release called chaos-mesh using the chaos-mesh chart.
helm repo add chaos-mesh https://charts.chaos-mesh.org
helm repo update
kubectl create ns chaos-testing
helm install chaos-mesh chaos-mesh/chaos-mesh --namespace=chaos-testing --set chaosDaemon.runtime=containerd --set chaosDaemon.socketPath=/run/containerd/containerd.sock
Enter fullscreen mode Exit fullscreen mode

Once the installation process is concluded, you’ll see an array of new pods dedicated to several tasks related to the chaos mesh.

$ kubectl get pod -A
NAMESPACE       NAME                                                              READY   STATUS    RESTARTS   AGE
chaos-testing   chaos-controller-manager-674f467db4-dznmm                         1/1     Running   0          2m49s
chaos-testing   chaos-controller-manager-674f467db4-f2wq4                         1/1     Running   0          2m49s
chaos-testing   chaos-controller-manager-674f467db4-vpjth                         1/1     Running   0          2m49s
chaos-testing   chaos-daemon-qnlbc                                                1/1     Running   0          2m49s
chaos-testing   chaos-dashboard-d47f8c5cd-x55qs                                   1/1     Running   0          2m49s
chaos-testing   chaos-dns-server-84d96c6dbc-v74b2                                 1/1     Running   0          2m49s
default         azure-vote-back-78df98c548These services are running within the "chaos-testing" namespace. The chaos-controller-manager pods are responsible for managing chaos experiments, while the chaos-daemon pod is in charge of coordinating and executing the chaos experiments-5jxpm                                  1/1     Running   0          14h
...
Enter fullscreen mode Exit fullscreen mode

And their respective services.

$ kubectl get service -A
NAMESPACE       NAME                                           TYPE           CLUSTER-IP     EXTERNAL-IP     PORT(S)                                 AGE
chaos-testing   chaos-daemon                                   ClusterIP      None           <none>          31767/TCP,31766/TCP                     3m44s
chaos-testing   chaos-dashboard                                NodePort       10.0.252.240   <none>          2333:31753/TCP,2334:30871/TCP           3m44s
chaos-testing   chaos-mesh-controller-manager                  ClusterIP      10.0.250.234   <none>          443/TCP,10081/TCP,10082/TCP,10080/TCP   3m44s
chaos-testing   chaos-mesh-dns-server                          ClusterIP      10.0.16.244    <none>          53/UDP,53/TCP,9153/TCP,9288/TCP         3m44s
...
Enter fullscreen mode Exit fullscreen mode

Among them, the chaos-controller-manager pods handle the management of chaos experiments, whereas the chaos-daemon pod oversees coordination and execution of these chaos experiments.

Include your AKS cluster among the resources targetable by Chaos Studio experiments

Even if Chaos Mesh is installed in your cluster, before Chaos Studio can start injecting faults, it’s necessary to include your AKS cluster in the list of resources managed by Chaos Studio. Here’s how you do that:

  • Browse and login to the Azure Portal
  • Search for Chaos Studio in the main search bar, and select Targets.
  • Check your AKS cluster, then click Enable targets and select Enable service-direct targets.

Image description

Click Review + Enable, and then Enable to confirm.

Image description

Create a chaos experiment

With your AKS cluster now enabled, it’s time to start defining our experiments.

  • Navigate to Chaos Studio.
  • Click on Experiments, then Create, and finally select New Experiment.

Image description

  • In the experiment creation form, assign a name to your experiment and choose the region in which the experiment will be stored.

Image description

  • Next, select the permissions that are going to be use to execute the experiment. These permissions can be either system-assigned or user-assigned managed identity.

Image description

Next comes the Experiment Designer section. This is where we’ll define the actions that will be performed against the targeted resources (in this case, the AKS instance). Depending on their configuration, these actions will be executed either sequentially or in parallel.

  • Assign a name to your Step and Branch, and click on Add fault.
  • In the side panel, you’ll have the option to choose from a wide array of pre-configured faults. For the purposes of this example, we’ll opt to cause a pod to fail for a duration of 10 minutes.

Image description

While the parameters of the faults depend on the type you’ve chosen, it’s worth noting that AKS faults, being based on Chaos Mesh, share two common parameters: Duration and jsonSpec.

Image description

In Chaos Mesh, you usually perform chaos experiments by deploying yaml manifests in your Kubernetes cluster, as shown below:

apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: pod-failure-example
  namespace: chaos-testing
spec:
  action: pod-failure
  mode: all
  duration: '600s'
  selector:
    namespaces:
      - default
Enter fullscreen mode Exit fullscreen mode

To convert your Chaos Mesh manifests to Azure Chaos Studio, all you have to do is convert the spec: block into a JSON format. For example:

{"action":"pod-failure","mode":"all","duration":"600s","selector":{"namespaces":["default"]}}
Enter fullscreen mode Exit fullscreen mode
  • Select Next: Target resources, and select the resources you wish to target with the experiment. Note that you’ll only see the resources that have been enabled in Chaos Studio.

Image description

  • Click Add to confirm.
  • Finally, click Review + create, and then Create. Upon completion, you’ll see the experiment in your resource group.

Image description

Grant Permissions to the Experiment Managed Identity

When a chaos experiment is created, Chaos Studio automatically generates a system-assigned managed identity that carries out faults on your target resources. However, we need to manually assign appropriate permissions to it before it can start interacting with the AKS cluster. Depending on the fault selected, the managed identity will require different permissions. You can view the complete list through this link.

Supported resource types and role assignments for Chaos Studio - Microsoft

To grant these permissions to the Experiment managed identity, follow these steps:

  • Browse to your AKS cluster page, select Access control (IAM), then click on Add followed by Add role assignment.

Image description

  • Choose the necessary role, then navigate to the Members tab, click on Select members and click the name of the experiment that you created earlier.

Image description

  • Click Select, then Review + Assign.

Start the Experiment

Now it’s time to introduce some chaos and ‘break’ things in our resources! 😈 To do so, just browse to your experiment and click Start.

Image description

If you’re interested in a more detailed look at what’s happening within your experiment, you can access information about the currently executing step by clicking Details.

References

Top comments (0)