Safer handling of resource multiplicity in Terraform

#terraform #iac #hcl #cloud

One of Terraform's best features when it launched, was its support of the count attribute in resources. This allowed for multiple instances of the same resource to easily spring into existence, just by setting a number

resource "aws_security_group" "webserver_sg" {
  name        = "webserver_sg"
  description = "Applied to Webserver fleet"
  vpc_id      = "${aws_vpc.main.id}"
}

resource "aws_security_group_rule" "example" {
  count             = 3
  type              = "ingress"
  description       = "rule ${count.index + 1}"
  from_port         = "8${count.index}"
  to_port           = "8${count.index}"
  protocol          = "tcp"
  cidr_blocks       = ["10.0.0.0/8"]
  security_group_id = aws_security_group.webserver_sg.id
}

Which yielded three security group rule instances like so:

 # aws_security_group_rule.example[0] will be created
  + resource "aws_security_group_rule" "example" {
      + cidr_blocks              = [
          + "10.0.0.0/8",
        ]
      + description              = "1"
      + from_port                = 80
      + id                       = (known after apply)
      + protocol                 = "tcp"
      + security_group_id        = (known after apply)
      + self                     = false
      + source_security_group_id = (known after apply)
      + to_port                  = 80
      + type                     = "ingress"
    }

 # aws_security_group_rule.example[1] will be created
  + resource "aws_security_group_rule" "example" {
      + cidr_blocks              = [
          + "10.0.0.0/8",
        ]
      + description              = "2"
      + from_port                = 81
      + id                       = (known after apply)
      + protocol                 = "tcp"
      + security_group_id        = (known after apply)
      + self                     = false
      + source_security_group_id = (known after apply)
      + to_port                  = 81
      + type                     = "ingress"
    }

 # aws_security_group_rule.example[2] will be created
  + resource "aws_security_group_rule" "example" {
      + cidr_blocks              = [
          + "10.0.0.0/8",
        ]
      + description              = "3"
      + from_port                = 82
      + id                       = (known after apply)
      + protocol                 = "tcp"
      + security_group_id        = (known after apply)
      + self                     = false
      + source_security_group_id = (known after apply)
      + to_port                  = 82
      + type                     = "ingress"
    }

And it was good.

Where the seams ripped open

This version of HCL resulted in an interesting side-effect, the resource's internal representation is essentially an array of resources whose length is the same as that of the count attribute. Terraform would then loop through the array, create the resource and shove it into its state file, using the resource's location in the array as its path in the state file, like so:

aws_security_group_rule.example[0]
aws_security_group_rule.example[1]
aws_security_group_rule.example[2]

For the sake of the exercise, let's say our servers should only be accepting traffic on ports 80 and 82, how would we achieve this? Before we get to that, let's refactor the code a little bit.

# Here we are initialising a list of maps
# Each map represents a single security group rule.
locals {
  ingress_rules = [
    {
      description = "test-one",
      from_port   = 80,
      to_port     = 80,
      protocol    = "tcp"
      cidr_blocks = "10.0.0.0/8"
    },
    {
      description = "test-two",
      from_port   = 81,
      to_port     = 81,
      protocol    = "tcp"
      cidr_blocks = "10.0.0.0/8"
    },
    {
      description = "test-three",
      from_port   = 82,
      to_port     = 82,
      protocol    = "tcp"
      cidr_blocks = "10.0.0.0/8"
    },
  ]
}

resource "aws_security_group" "webserver_sg" {
  name        = "webserver_sg"
  description = "Applied to Webserver fleet"
  vpc_id      = "${aws_vpc.main.id}"
}

resource "aws_security_group_rule" "example" {
  count             = length(local.ingress_rules)
  description       = lookup(local.ingress_rules[count.index], "description")
  from_port         = lookup(local.ingress_rules[count.index], "from_port")
  to_port           = lookup(local.ingress_rules[count.index], "to_port")
  protocol          = lookup(local.ingress_rules[count.index], "protocol")
  cidr_blocks       = lookup(local.ingress_rules[count.index], "cidr_blocks")
  security_group_id = aws_security_group.webserver_sg.id
}

Removing the second entry in the ingress_rules list would result in the following terraform plan:

 # aws_security_group_rule.example[1] must be replaced
  -/+ resource "aws_security_group_rule" "example" {
        cidr_blocks              = [
          + "10.0.0.0/8",
        ]
      ~ description              = "test-two" -> "test-three" # forces replacement
      ~ from_port                = 81 -> 82 # forces replacement
        id                       = (known after apply)
        protocol                 = "tcp"
        security_group_id        = (known after apply)
        self                     = false
        source_security_group_id = (known after apply)
      ~ to_port                  = 81 -> 82 # forces replacement
        type                     = "ingress"
    }

 # aws_security_group_rule.example[2] will be destroyed
  - resource "aws_security_group_rule" "example" {
      - cidr_blocks              = [
          + "10.0.0.0/8",
        ]
      - description              = "test-three"
      - from_port                = 82
      - id                       = (known after apply)
      - protocol                 = "tcp"
      - security_group_id        = (known after apply)
      - self                     = false
      - source_security_group_id = (known after apply)
      - to_port                  = 82
      - type                     = "ingress"
    }

What just happened there? Be removing the middle entry in ingress_rules, we caused the position of the objects in the array to shift to the left by one, which in turn caused the state checker to flag aws_security_group_rule.example[1] for replacement and aws_security_group_rule.example[2] for deletion.

This is not really ideal, if this plan were to be applied, not only would it remove the rule we're trying to get rid of, the last rule in the list will end up being deleted and recreated for no real reason other than the position shift in the list, which might result in disrupting the traffic to our fleet of webservers. Can we get around this side effect?

Yes we can!

Terraform 0.12 brought alon HCL 2.0 which adds support for dynamic expressions and loops, which has been a game changer in how HCL is written.

Today we will be looking at two features (for loops and for_each expressions) that can be applied to our code which helps us side-step our issue in a a clean and elegant way.

The first thing we will change is the count attribute on the aws_security_group_rule resource, we will be replacing it with a for_each expression, like so

resource "aws_security_group_rule" "test_two_rules" {
  for_each          = { for rule in local.ingress_rules : "${rule.description}-${rule.protocol}" => rule }
  type              = "ingress"
  ...
}

At first this seems like a more complicated way to instantiate multiple instances than what the count attribute had done, but once you understand how it's implemented, it starts to make a lot of sense, let's take a look at a snippet from the Terraform documentation on for_each:

for_each: Multiple Resource Instances Defined By a Map, or Set of Strings


Note: A given resource block cannot use both count and for_each.

By default, a resource block configures one real infrastructure object. However, sometimes you want to manage several similar objects,
such as a fixed pool of compute instances.
Terraform has two ways to do this: count and for_each.

The for_each meta-argument accepts a map or a set of strings, and creates an instance for each item in that map or set.
Each instance has a distinct infrastructure object associated with it (as described above in Resource Behavior),
and each is separately created, updated, or destroyed when the configuration is applied.

So this is key, for_each doesn't accept a count, rather a map or a set of strings. In our case, we're using a list of maps ( or a list of object of type map to be more precise ), which is incompatible with for_each. How can we make it compatible? that's where the for loop comes in.

{
for rule in local.ingress_rules :
 "${rule.description}-${rule.protocol}" => rule
}

This for loop iterates through our list and returns each object as a key value pair which gets appended to the outer set {} encompassing the loop expression.

Each entry has a key comprised of the description and the protocol fields of our object, this is important as each entry needs to have a unique key so that Terraform can tell the entries apart, after the => we're basically returning the entire rule object as the value, since we will need to access its attributes to construct our resource instances.
The refactored code looks like this

resource "aws_security_group" "test_two" {
  name        = "test_two"
  description = "Allow TLS inbound traffic"
  vpc_id      = data.aws_vpc.selected.id
}

resource "aws_security_group_rule" "test_two_rules" {
  for_each          = {
    for rule in local.ingress_rules :
      "${rule.description}-${rule.protocol}" => rule
  }
  type              = "ingress"
  description       = lookup(each.value, "description")
  from_port         = lookup(each.value, "from_port")
  to_port           = lookup(each.value, "to_port")
  protocol          = lookup(each.value, "protocol")
  cidr_blocks       = lookup(each.value, "cidr_blocks")
  security_group_id = aws_security_group.test_two.id
}

Let's comment the middle entry in the list back in and re-create everything and afterwards comment it out again and see how terraform deals with it

 # aws_security_group_rule.example["test-two-tcp"] will be destroyed
  - resource "aws_security_group_rule" "example" {
      - cidr_blocks       = [
          - "172.31.0.0/16",
        ] -> null
      - description       = "test-two" -> null
      - from_port         = 444 -> null
      - id                = "sgrule-2053168410" -> null
      - ipv6_cidr_blocks  = [] -> null
      - prefix_list_ids   = [] -> null
      - protocol          = "tcp" -> null
      - security_group_id = "sg-0218c194b68ccec75" -> null
      - self              = false -> null
      - to_port           = 444 -> null
      - type              = "ingress" -> null
    }

Plan: 0 to add, 0 to change, 1 to destroy.

Excellent!! You're probably wondering how we got around the index shift problem, the answer is pretty simple and visible in the resource path!

The array index is no more, the state is actually a map now, its index is "test-two-tcp", which lets Terraform pluck the correct instance out of the state using its unique key, rather than simply shifting array indices!!

I hope you've found this example to be useful and that it hopefully helps you make changes safer and without side-effects on your operations!