DEV Community

Cover image for Chaos Engineering with LitmusChaos on AWS EKS using IRSA
NaveenKumar Namachivayam โšก for AWS Community Builders

Posted on • Originally published at qainsights.com

Chaos Engineering with LitmusChaos on AWS EKS using IRSA

I have already kicked off a brand-new series on my YouTube channel called Learn Chaos Engineering series. The first few episodes will be focusing on LitmusChaos. I have been working on LitmusChaos for the past several weeks. The more I work, the more questions arise about the mechanics of LitmusChaos. In episode 2, we saw an EC2 instance termination experiment using AWS Secrets, which is one way to run chaos on AWS. Exposing secrets and access keys is not a recommended practice. Enter IRSA. IAM Role for Service Account. Instead of creating a video, let me write it as a blog post; I hope it will make a great impression on people who are getting started. Before writing this blog post, I practiced several times. Let us proceed.

https://youtube.com/playlist?list=PLJ9A48W0kpRKyBBmwOz6oSn4s3A90HHCj

Prerequisites

There are several moving components when it comes to AWS EKS. Few are optional, but it would be a great add-on if you hit a roadblock. e.g. I am using Octant to visualize the Kubernetes ecosystem. You can use Lens or Kubernetes Dashboard if you are familiar with them.

Below are the stuffs you need before you proceed:

  1. eksctl to spin up AWS EKS cluster
  2. LitmusChaos
  3. AWS IAM Role and Policy creation access
  4. Helm (optional)
  5. Octant (optional)
  6. EC2 instance to run the experiments on it.

What is LitmusChaos?

Litmus is an open source Chaos Engineering platform that enables teams to identify weaknesses & potential outages in infrastructures by inducing chaos tests in a controlled way.

Recently, it was acquired by Harness. It is an CNCF Incubation project.

eksctl

eksctl is a great CLI utility to perform various AWS EKS tasks such as cluster management, hardening the security, GitOps and more. It is written in Go and uses CloudFormation under the hood.

Using eksctl we can create a cluster in one command. But I do not recommend that. e.g. it will create two m5.large worker nodes on us-west-2 which we do not want for learning purpose.

Instead, use the --dry-run command to output to a file and modify the parameters such as region, instance type, availability zone etc.

Install eksctl from this documentation.

Enter the below command which will create a EKS yaml manifest.

eksctl create cluster --name=chaos-playground --region=us-east-1 --dry-run > chaos-playground.yaml

The name of our cluster is chaos-playground and will be created on us-east-1. Change the region based on your location so that you will reach the nearest availability zone.

Open the yaml manifest in your favorite editor e.g. vim chaos-playground.yaml

Change the instance type to t3.medium. Save it and apply the below command to start creating a cluster.

eksctl create cluster -f chaos-playground.yaml

Go for a coffee or tea or your favorite beverage and come back after 15-25 minutes.

Once the EKS is up and running, you will get the below message.

EKS cluster "chaos-playground" in "us-east-1" region is ready

Updating kube config

You might be running various clusters locally or in the cloud. Let us update the kube config current context using the below command.

aws eks update-kubeconfig --name chaos-playground --region us-east-1

The above command change the context to chaos-playground. To validate the current context, enter the below command.

k config current-context

I have an alias k for kubectl. You may need to enter kubectl in place of k.

Install kubectl from here.

LitmusChaos

Let us begin by installing LitmusChaos in litmus namespace. You can install LitmusChaos from here. By default, the services litmusportal-frontend-service and litmusportal-server-service are exposed as NodePort. We need to expose it as ClusterIP. I have already changed the type and kept it in my GitHub repo. Let us apply that to our cluster using the below command.

k apply -f https://raw.githubusercontent.com/QAInsights/Learn-Chaos-Engineering-Series/main/LitmusChaos-AWS-EKS/litmuschaos-2.9.0.yaml

The above command will create various Kubernetes objects as shown below.

Installing LitmusChaos

By default, all the Litmus Custom Resource Definitions (CRDs) will be installed in the litmus namespace.

Let us verify the pods which are running in the litmus namespace.

k get po -n litmus

k get po -n litmus
k get po -n litmus

The next step is to expose the service litmusportal-frontend-service to get the load balanced URL.

Enter the below command to patch the service.

k patch svc litmusportal-frontend-service -p '{"spec": {"type": "LoadBalancer"}}' -n litmus

To get the load balanced URL, enter the below command.

k get -n litmus svc litmusportal-frontend-service -o wide

You will get the URL and port. e.g. you can access the LitmusChaos UI using port 9091 and URL http://a336e7fc0d2d64029abdf53d95aa1cca-97284119.us-east-1.elb.amazonaws.com:9091

Let us wait for a few moments before the URL is reachable. Meanwhile, let us install Octant from here.

Run the Octant program and launch 127.0.0.1:7777 the homepage of Octant.

The first step is to configure the kube config. Go to Preferences > General and note down the kube config path. Make sure our AWS EKS kube config is present in the Octant kube config. If it is not present, Octant will not display the chaos-playground objects.

Here are the pods which are running in litmus namespace on chaos-playground cluster.

Litmus Pods
Litmus Pods

Now it is time to launch the LitmusChaos UI.

The default credential is admin/litmus

Upon the first time login, LitmusChaos will prompt you to change the password. You can skip it or change the credentials.

After successful login, click ChaosAgents to validate the Self-Agent status.

Wait till the Self-Agent is in Active status as shown below.

ChaosAgent

Congratulations! LitmusChaos is up and running now. Let us run an experiment. Before we begin running the experiments, we need to form a hypothesis.

A single EC2 instance running with no auto-scaling group.

Provided

Failure of a single EC2 instance will disrupt the customer experience.

Hypothesis

Spin an EC2 Instance

If you are new to AWS, please follow this tutorial to spin an EC2 instance in your region.

For this demo, I have spun up an EC2 instance with t2.micro which is running a nginx server.

EC2 Instance up and running
EC2 Instance up and running

EC2 Terminate By ID Experiment

Let us go back to LitmusChaos and create a new experiment which will terminate the above EC2 instance using its ID.

Click Litmus Workflows > Schedule a workflow to start creating a workflow.

Select the Self-Agent, then click Next.

Click Create a new workflow using the experiments from ChaosHubs radio button and select Litmus ChaosHub, then click Next.

Click Next again, leaving the default values for the workflow name.

Click Add a new experiment and search for ec2.

Select kube-aws/ec2-terminate-by-id and then click Done.

Select experiment
Select experiment

Click the pen icon to edit the experiment.

Edit YAML
Edit YAML

Click Next three times, leaving the default options.

Under the Tune Experiment section, enter the EC2 Instance ID.

Click Show more environment variables button to enter the region, then click Finish.

Tune Experiment
Tune Experiment

Click Next few times and then Finish to start the experiment. This experiment will fail after a while.

Failed Experiment

Why it got failed?

LitmusChaos objects doesn't have any access to kill the EC2 instance. Even though we are running everything on the same AWS account. To authenticate LitmusChaos objects to the AWS resources, there are two ways: using AWS Secrets or using IAM service.

Mounting AWS secrets is not a recommended practice for running the experiment. As we are exposing the secrets in YAML file.

Enter IRSA. IAM Role for Service Account.

IAM is a web service that helps you securely control access to AWS resources.

Service Account in this context is meant for Kubernetes. A Kubernetes service account provides an identity for processes that run in a pod.

The following are the steps which are involved in creating IAM roles for service account.

  1. Create OIDC provider
  2. Create IAM Role and Policy
  3. Associate IAM Role

Create OIDC provider

The OIDC concept is beyond the scope of this article. On a high level, OpenID Connect authenticates the AWS API with the supported identity providers and receives a JWT token. This token can be passed into AWS STS to receive temp credentials to authenticate into any AWS service like EC2.

Enter the below command to validate the OIDC provider for the cluster.

aws eks describe-cluster --name chaos-playground --region us-east-1 --query "cluster.identity.oidc.issuer" --output text

This will list the following output.

https://oidc.eks.us-east-1.amazonaws.com/id/B718311B05C5C27CCF96C406CEXXXXXX

If there is no output, enter the below command to create one.

eksctl utils associate-iam-oidc-provider --cluster chaos-playground --region us-east-1 --approve

Create IAM Policy and Role

Copy the below policy and save it as chaos-playground.json. Replace the account ID with your account ID.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": "ec2:*",
            "Resource": "*"
        },
        {
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": "sts:*",
            "Resource": "arn:aws:iam::<your-account-id>:role/*"
        }
    ]
}

To create a policy, enter the below command.

aws iam create-policy --policy-name ChaosPlaygroundPolicy --policy-document file://chaos-playground-policy.json

Each entity in AWS has a unique identifier called ARN. To retrieve the ARN for the above policy, enter the below command.

aws iam list-policies --query 'Policies[?PolicyName==ChaosPlaygroundPolicy].Arn' --output text

Copy the output which we are going to need it in the subsequent step.

The next step is to create a Trust Policy for our IAM role.

Save the below trust policy as chaos-playground-trust.json and make sure you replace the OIDC value appropriately.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Federated": "arn:aws:iam::<your-account-id>:oidc-provider/oidc.eks.us-east-1.amazonaws.com/id/B718311B05C5C27CCF96C406XXXXXXX"
            },
            "Action": "sts:AssumeRoleWithWebIdentity",
            "Condition": {
                "StringEquals": {                   
                    "oidc.eks.us-east-1.amazonaws.com/id/B718311B05C5C27CCF96C406XXXXXXX:aud": "sts.amazonaws.com"
                }
            }
        }
    ]
}

aws iam create-role --role-name Chaos-Playground-Role --assume-role-policy-document file://chaos-playground-trust.json

Once the role is created with the trust policy, the next step is to attach a policy which we created earlier, i.e. ChaosPlaygroundPolicy

To attach it to the IAM role, enter the below command.

aws iam attach-role-policy --policy-arn arn:aws:iam::<your-account-id>:policy/ChaosPlaygroundPolicy --role-name Chaos-Playground-Role

Associate IAM Role

To associate the IAM Role to the Kubernetes Service Account, the first step is to create a service account using eksctl command.

Enter the below command to create a IAM SA.

eksctl create iamserviceaccount --cluster=chaos-playground --namespace=litmus --name=ec2-terminate-sa-litmus --attach-policy-arn="arn:aws:iam::<your-account-d>:policy/ChaosPlaygroundPolicy" --override-existing-serviceaccounts --approve --region us-east-1

To validate all the service accounts in litmus namespace. Enter the below command.

k get sa -n litmus

k get sa -n litmus
k get sa -n litmus

After creating a service account, the next step is to annotate the service account with the IAM role ARN using the below command. The below command will annotate the litmus-admin service account. This will suffice for our experiment.

k annotate serviceaccount -n litmus litmus-admin eks.amazonaws.com/role-arn=arn:aws:iam::202835464218:role/Chaos-Playground-Role --overwrite

To validate the annotations, enter the below command.

k describe sa -n litmus ec2-terminate-sa-litmus

LitmusChaos Experiments

Let us run the experiment again. But before triggering it, we need to remove the mounted secrets from our experiment. As we are using IRSA, we do not need to mount it.

Head to LitmusChaos > Litmus Workflows > Schedules.

Click the vertical three dots and then click Download Manifest.

Download Manifest

Open the manifest in your favorite editor. Remove the lines which has secrets. Save it.

Create a new experiment by upload the YAML as shown below. Then, start the execution.

Rerun the experiment

If all is good, this time, our experiment will pass.

While the experiment is running, you can check the logs from Octant.

Octant Logs

Or check the status in EC2 Instances dashboard.

EC2 Stopped
EC2 Stopped

Once the experiment is completed, LitmusChaos will revert the state of the EC2 instance.

Reverted Status
Reverted Status

Here is the LitmusChaos workflow graph view.

Experiment Pass

Thanks for staying with me :)

Important Notes

  • The policy we created basically opens for all the resources and the action, you must fine tune it for better security. Since it is a demo, I was being playful.
  • The annotations are for all the service accounts. I yet to validate which SA needs to be annotated properly.
  • Once the experiment is done, make sure you terminate the cluster and other resources.

Conclusion

IRSA is a beautiful implementation of a zero-trust framework. LitmusChaos is architected to answer the needs of enterprise security who are in the AWS and GCP ecosystem. We have just scratched the surface of LitmusChaos. Eventually, I will cover the other experiments in my channel.

Discussion (0)