How to deploy and manage a RabbitMQ cluster on Amazon EKS using Terraform and Helm

#rabbitmq #kubernetes #helm #terraform

In this blog post, we will dive into deploying and managing RabbitMQ on Kubernetes with the powerful combination of Terraform and Helm. As businesses increasingly embrace microservices architecture, efficient message queuing becomes paramount, and RabbitMQ emerges as a robust solution. In this guide, we'll walk you through the step-by-step process of setting up RabbitMQ on a Kubernetes cluster, leveraging the infrastructure-as-code capabilities of Terraform and the Kubernetes package manager Helm.

When should you consider deploying RabbitMQ on Kubernetes

There are a few different ways you can leverage RabbitMQ for

You can deploy RabbitMQ on a single cloud VM (e.g: on an EC2 machine)

This option can be suitable for you in some cases but it has the obvious disadvantage of having to patch, secure, and monitor RabbitMQ all by yourself.
Also, a single-node RabbitMQ may not be the most optimal solution for you since, it will not be highly available, fault-tolerant, or scalable.
Furthermore, you would probably need a few RabbitMQ instances for different services. Let's say you need ten independent RabbitMQ instances for ten independent services in your microservices. What if this number (10) can change quite rapidly? In these sorts of circumstances, it would be increasingly difficult to operate these instances for your DevOps team.

Use a cloud provider-managed RabbitMQ e.g. AmazonMQ.

While AmazonMQ does most of the heavy lifting for you out of the box, you might want to avoid it due to reasons such as

cost
avoiding a vendor lock-in

Deploy it on Kubernetes!

This is exactly what we are going to do today and we will see how effectively this solution deals with most of the problems we discussed in the other two options we discussed.

Challenges with deploying RabbitMQ on a Kubernetes cluster:

How should we persist RabbitMQ state?

We are gonna be using Kubernetes StatefulSets and PersistentVolumes to solve this problem. When a StatefulSet creates a pod, it automatically creates a PVC (persistent volume claim) for that pod based on the template defined in the StatefulSet's volumeClaimTemplates field. The PVC ensures that the pod has access to persistent storage, and the associated PV is dynamically provisioned or statically bound based on the storage class and other specifications. The StatefulSet ensures that each pod receives a unique identifier in the form of an ordinal index, and this identifier is used to create a unique PVC for each pod. Furthermore, StatefulSets provides stable network identities for pods. This is essential for stateful applications that might rely on predictable network addresses or hostnames. The stable network identity allows for services, applications, or other components to consistently locate and communicate with stateful pods, even if they are rescheduled or re-created.

How should the different nodes in a RabbitMQ cluster communicate with each other inside Kubernetes?

StatefulSets explained above already solves half of the problem and the other half is gonna be solved by Headless-Services. A headless service is a service without a cluster IP, and it is used to create DNS records for the pods it selects. Each selected pod gets its unique DNS record based on the service name. The RabbitMQ pods can directly communicate with each pod using its unique DNS name to register and de-register pods in the cluster.

Log lines when the process takes place:



2024-01-16 13:11:18.176142+00:00 [warning] <0.731.0> Peer discovery: node rabbit@dev-my-service-a-rabbitmq-1.dev-my-service-a-rabbitmq-headless.rabbitmq.svc.cluster.local is unreachable
2024-01-16 13:11:28.194766+00:00 [info] <0.544.0> node 'rabbit@dev-my-service-a-rabbitmq-1.dev-my-service-a-rabbitmq-headless.rabbitmq.svc.cluster.local' up
2024-01-16 13:11:33.074253+00:00 [info] <0.544.0> rabbit on node 'rabbit@dev-my-service-a-rabbitmq-1.dev-my-service-a-rabbitmq-headless.rabbitmq.svc.cluster.local' up

How should we automate the provisioning of as many RabbitMQ instances as our applications need?

We will write a Terraform module that will take a list of configurations for each required RabbitMQ instance. Luckily for us, we don't have to write the Kubernetes yaml configurations since the helm charts by Bitnami does a great job of doing all the things we discussed above. All we need to do is leverage Terraform Helm Provider and deploy the chart with the required values for our use case.

Terraform module to provision a list of RabbitMQ instances

Defining the list of RabbitMQ instances each of our applications needs in a yaml file. Here's how it might look like:

instances.yaml



nlb_subnets: "subnet-0123456789a\\, subnet-0123456789b\\, subnet-0123456789c" 

instances:
  my-service-a-rabbit:
    username: service-a-admin
    password: aws-kms-encrypted-super-secret-password-a
    image_registry: docker.io
    image_repository: bitnami/rabbitmq
    image_version: 3.12.12
    replica_count: 3

  my-service-b-rabbit:
    username: service-b-admin
    password: aws-kms-encrypted-super-secret-password-b
    replica_count: 2

  my-service-c-rabbit:
    username: service-a-admin
    password: aws-kms-encrypted-super-secret-password-c
    replica_count: 2

  my-service-d-rabbit:
    username: service-b-admin
    password: aws-kms-encrypted-super-secret-password-d
    replica_count: 4

Here, nlb_subnets is a list of subnet IDs that we would require to create a Network Load Balancer. These can be public or private subnets depending on your load balancer's scheme. Here, we are using a Network Load Balancer as an access point for each RabbitMQ instance on the list. You might ask why we need a Network Load Balancer (not an Application Load Balancer)? Well, RabbitMQ clients communicate with RabbitMQ servers using the amqp:// protocol, and Application Load Balancers can only balance HTTP or HTTPS traffic. NLBs can load balance traffic at the TCP layer which does not depend on layer 7 protocols. Please note, we are using AWS Load Balancer Controller to provision the AWS load balancers.

Now, we need to make sure our Terraform module can read the yaml file name as a variable and parse it accordingly to set different values Helm would require.

variables.tf



variable "config_file" {
  description = "yaml file containing configuration for all rabbitmq instances"
}

main.tf



locals {
  config    = yamldecode(file("${path.root}/conf/${var.config_file}"))
  instances = local.config["instances"]
  subnets   = local.config["nlb_subnets"]
}

data "aws_kms_secrets" "decrypt_password" {
  for_each = local.instances
  secret {
    name    = "master_password"
    payload = each.value.password
  }
}

resource "helm_release" "rabbit" {
  for_each         = local.instances
  name             = each.key
  namespace        = lookup(each.value, "namespace", "rabbitmq")
  create_namespace = true
  timeout          = 600
  repository       = lookup(each.value, "chart_repository", "oci://registry-1.docker.io/bitnamicharts")
  chart            = "rabbitmq"
  version          = lookup(each.value, "chart_version", "12.6.2")

  set {
    name  = "auth.username"
    value = lookup(each.value, "username", "rabbit")
  }

  set {
    name  = "auth.password"
    value = lookup(data.aws_kms_secrets.decrypt_password, each.key, null) != null ? data.aws_kms_secrets.decrypt_password[each.key].plaintext["master_password"] : null
  }

  set {
    name  = "auth.erlangCookie"
    value = each.key
  }

  set {
    name  = "image.registry"
    value = lookup(each.value, "image_registry", "docker.io")
  }

  set {
    name  = "image.repository"
    value = lookup(each.value, "image_repository", "bitnami/rabbitmq")
  }

  set {
    name  = "image.tag"
    value = lookup(each.value, "image_version", "3.12.12")
  }

  set {
    name  = "replicaCount"
    value = lookup(each.value, "replica_count", 1)
  }

  set {
    name  = "service.type"
    value = "LoadBalancer"
  }

  set {
    name  = "service.annotations.service\\.beta\\.kubernetes\\.io/aws-load-balancer-type"
    value = "external"
  }

  set {
    name  = "service.annotations.service\\.beta\\.kubernetes\\.io/aws-load-balancer-nlb-target-type"
    value = "ip"
  }

  set {
    name  = "service.annotations.service\\.beta\\.kubernetes\\.io/aws-load-balancer-subnets"
    value = local.subnets
  }
}

Similarly, override any other values you might need from here: https://github.com/bitnami/charts/blob/main/bitnami/rabbitmq/values.yaml
Consider reading about the ErlangCookie which is very important for clustered RabbitMQ environments.

Now, we just need to invoke our module like so, ```hcl

module "rabbit" {
source = "../modules/rabbitmq-k8s" # make sure you keep the module files accordingly
config_file = "instances.yaml"
}


- Now, we just need to run Terraform plan and apply!
```bash


terraform plan -out=plan.out
terraform apply plan.out

Once the Terraform apply is complete our application clients can start communicating with the RabbitMQ instances using one of the following approaches:
I) The ClusterIP service DNS name (only works if your applications are also inside the same cluster)
II) The NLB hostname
III) If you don't prefer using the NLB hostnames, consider using External-DNS to attach a custom domain name

And that's all folks! Feel free to comment if you have any questions. If you found this post helpful, CLICK BELOW 👇