Cloud provides us the compute resources to perform machine learning jobs such as training, hyper parameter-tuning, inference which requires high CPU and memory utilisation. How do we ensure that the compute targets can withstand CPU pressure to safely execute the Machine Learning model jobs? This blog walks you through applying Chaos Engineering approaches in Machine Learning Operations (MLOps) to establish a well-architected Azure solution.
This blog covers two main sections:
- Attaching compute targets to Azure ML Workspace
- Create and Run Chaos experiment on Azure VM
If you are knew to Chaos Engineering, do check out the basics concepts in previous blog here
Attaching remote VM for Machine Learning Jobs
Create a Data Science Linux Virtual Machine
Go to Azure Portal -> Virtual Machine and create one.
Find the virtual machine listing by typing in "data science virtual machine" and selecting "Data Science Virtual Machine- Ubuntu 18.04". You can find the info here
Create Azure Machine Learning workspace
In Azure Portal, search for Machine Learning resource and create one!
Attach Remote VM as our compute instance
To attach our Data Science Linux Virtual Machine as our compute instance to Machine Learning workspace, follow the steps below.
- Navigate to Azure Machine Learning Workspace
- Go to compute -> Attached Computes
- Click new and add Virtual Machine
- Enter the relevant details of our Virtual Machine and you will be able to see the compute instance listed in the portal.
Create and Run Chaos experiment on Azure VM
We will cause a high CPU event on a Linux virtual machine Compute Instance using a chaos experiment and Azure Chaos Studio. Running this experiment can help you defend against an application becoming resource-starved.
Once the compute instance is up and running, SSH into our VM
Install stress-ng
For our Chaos Experiment we will use stress-ng
, an open-source application that can cause various stress events on virtual machine.
SSH into your VM (refer) you have created and install stress-ng
by following command in your VM terminal.
sudo apt-get update && sudo apt-get -y install unzip && sudo apt-get -y install stress-ng
Create Managed Identity
Create Managed Identity resource and navigate to Identity Access Management (IAM)
Add Role Assignment of contributor
to our VM.
Set Target in Chaos Studio
Go to Chaos Studio and select Targets. Select our VM.
Create Chaos Experiment
Click Chaos Experiment and create one.
Under Experiment Designer, we are going to design two different faults:
Branch 1_Step1: CPU Pressure
Set the pressure to 95% for 10 minsBranch 1_Step2: Physical Memory Pressure
Set the pressure to 95% for 10 mins
Review and create the experiment
Give permissions to Chaos Experiment
Navigate to our VM and give Contributor access to Chaos Experiment.
Start the Chaos Experiment
Navigate back to Chaos Experiment and start the experiment. You can click on the details of the run to see injection faults
Monitor the Experiment
Once the experiment is started, you can notice the state of the experiment
CPU Pressure
SSH into our VM and use the command top
to use the CPU Utilisation of the VM. As per our fault injection, the stress-ng
exerts 95% CPU Pressure on the VM
Physical Memory Pressure
Using the same command, you can notice the free memory is less 5% as per our injection fault in VM
Once the experiment is over, the VM returns to its normal state.
Experiment Results
Observation:
We noticed both the faults has been successfully injected by our Chaos Experiment
Inference:
We inferred that the VM is capable of handling high pressure of CPU and memory which is suitable for Machine Learning Jobs such as ML Model training, hyper parameter tuning etc
Mitigation:
You can set up alerts, Load balancing, backup VM as mitigation plans for the same.
Conclusion
We have successfully performed load testing on our Data Science Virtual Machine to ensure that the VM is resilient to handle high pressure of CPU an memory utilisation.
With Chaos Engineering, you can ensure that our MLOps is stable, robust and resilient to faults and failure.
Delete the resource group chaos
once you are done to prevent additional charges.
If you had liked this article, show some love dropping heart and sharing across your social handles.
Let's add more chaos to our Machine Learning in upcoming blogs, stay tuned!
Top comments (0)