DEV Community

Cover image for Terraform : Production grade IaC coding
Anshuman Abhishek
Anshuman Abhishek

Posted on • Updated on

Terraform : Production grade IaC coding

TL;DR: This article is not meant to clear the basics of IaC or Terraform. This is also not to focus on the certification exam. This blog is for those who know Terraform but don't have practical exposure. Or anyone who wants to improve their skill in Terraform and want to implement it in a better way.




Let's Go!




Basics

Terraform loads variables in the following order, with later sources taking precedence over earlier ones:

  • Environment variables
  • The terraform.tfvars file, if present.
  • The terraform.tfvars.json file, if present.
  • Any *.auto.tfvars or *.auto.tfvars.json files, processed in lexical order of their filenames.
  • Any -var and -var-file options on the command line, in the order they are provided. (This includes variables set by a Terraform Cloud workspace.)

Type of Loops:
count - to loop over resources
for_each - to loop over resources and inline blocks within a resource
for - to loop over lists and maps

Suggestion: Try to avoid count parameter and use for_each when you are looping over the resources. **But why?

For instance: You have created 3 IAM users using count and it stored in state file as a list by terraform
aws_iam_user.test[0]: testuser1
aws_iam_user.test[1]: testuser2
aws_iam_user.test[2]: testuser3

And now if you want to remove testuser2 from the list. At the time of the run plan, it shows that you want to rename testuser2 as testuser3 and delete testuser3. Why does it happen? because it works on index of the list.

In case of for_each:

resource "aws_iam_user" "example" {
for_each = toset(var.user_names)
name = each.value
}

*toset to convert the var.user_names list into a set.

Other use cases like applying tags to any resource
dynamic "tag" {
for_each = {
for key, value in var.custom_tags:
key => upper(value)
if key != "Name"
}

Another use case: What if in prod you want to create autoscaling but not in dev env using the same module. Terraform doesn’t support if-statements. However, you can accomplish the same thing by using the count parameter like this:

count = var.enable_autoscaling ? 1 : 0

*You may want to use count to conditionally create resources, but use for_each for all other types of loops and conditionals.

Zipmap is used to combine 2 list like [a,b],[1,2]
a=1, b=2

Use case, when you need output that contains a direct mapping of IAM names and ARNs.

Use create_before_destroy. For instance: If you don't want downtime when replacing ASG, create the replacement first, and only delete the original after

lifecycle {
create_before_destroy = true
}

Time to time locking error will come when you use a remote state. To unlock
terraform force-unlock id

Provisioners can be used to model specific actions on the local machine or on a remote machine in order to prepare servers or other infrastructure objects for service. But as much as possible avoid using these because IaC is not built to do these.

  • Local-exec
  • Remote-exec
  • File // to send file
  • Connection // help to make connection
  • Provisioners Without a Resource
  • Chef // chef client run

Use taint when anything happens to a resource and you want to recreate. This command manually marks a Terraform-managed resource as tainted, forcing it to be destroyed and recreated on the next apply
terraform taint aws_vpc.myvpc

terraform state push command is used to manually upload a local state file to a remote state. This command also works with local state

What happens when you are getting errors and need to resolve them? Debug it by changing the log level
export TF_LOG=TRACE

HashiCorp style conventions state that you should use 2 spaces between each nesting level to improve the readability of Terraform configurations.

Terraform supports the #, //, and /../ for commenting Terraform configuration files.

What to do when you are setting up a repo for the first time? Put below these in .gitignore:

  • .terraform*
  • terraform.tfstate
  • *.auto.tfvars and terraform.tfvars (may containe sensetive data such as public IP and domain name)

Generate graphs are always a good idea to get more insight into the Infrastructure which is set up or going to be set up using terraform. To create graph:
terraform graph > graph.dot
//Human readable but can convert in image
sudo apt install graphviz
cat graph.dot | dot -Tsvg > graph.svg
Now open in google chrome

What if your infra is too big? Like you have 100s or 1000s of VMs running. And when you run terraform plan, internally it calls a lot of APIs request to gather information. You will find sluggishness in behavior. To save from this type of situation always divide your code into modules - ec2 module, rds module, etc and run terraform plan on modules
terraform plan -target=module.mymodule.aws_instance.myinstance
terraform apply -target=module.mymodule.aws_instance.myinstance

If you still find sluggishness, can stop refresh during the plan by setting refresh=false flag. Or can only refresh ec2 resource
Terraform plan -target=ec2

~ means there is an update in plan

Modules



All Terraform code is in a module! The top-level module is called the Root module. The module is just regular Terraform code in a folder. Modules can be nested.

Type of modules:

  • Local module
    module "local-module" {
    Source = "/path/to /module"
    }

  • Terraform Registry
    module "published-registry" {
    Source = "anshuman/lambda-function-archive"
    }

  • SCM repo module
    module "scm-module {
    source = "github.com/anshuman-project/terraform
    }

_* When you configure s3 and DynomoDb and then when you run Terraform apply you will not see state files in local

  • In GCP and Azure, their buckets have the capability to store locking states. So, they don't need any DB._

Always use small modules and sub-modules. Large modules are slow, insecure, risky, difficult to understand, difficult to review, and difficult to test. Like in programming the first rule of functions is that they should be small. The second rule of functions is that they should be smaller than that. Large module/function smell.

Basic principles in your Terraform modules: pass everything in through input variables, return everything through output variables, and build more complicated modules by combining simpler modules.

Every Terraform module you have in the modules folder should have a corresponding example in the examples folder. And every example in the examples folder should have a corresponding test in the test folder.

Image description

A great practice to follow when developing a new module is to write the example code first, before you write even a line of module code.

Pin exact Terraform version because terraform is still in BETA phase. Once a Terraform state file has been written with a newer version of Terraform, you can no longer use that state file with any older version of Terraform.

AWS, GCP, and Azure providers updates often and does a good job of maintaining backward compatibility, so you typically want to pin to a specific major version but allow new patch versions to be picked up automatically so that you get easy access to new features.

Provisioners can be defined only within a resource, but sometimes, you want to execute a provisioner without tying it to a specific resource. You can do this using provisioners with null_resource.

resource "null_resource" "example" {
provisioner "local-exec" {
command = "echo \"Hello, World from $(uname -smp)\""
}
}

Define Backend



Terraform - from 2019 new feature to the partial configuration of backend. And you don't have to use the variable in the backend configuration. Before it was tedious to do backend configuration in each module. Now only need to provide key value in each module
terraform {
backend "s3" {
key = "example/terraform.tfstate"
}
}

Rest you need to provide during the run

`$cat backend.hcl
bucket = "terraform-up-and-running-state"
region = "us-east-2"
dynamodb_table = "terraform-up-and-running-locks"
encrypt = true

terraform init -backend-config=backend.hcl`

*init - It downloads providers and modules, and configures your backends, all in one handy command

When we use remote states, Terraform savse all the temporal information on memory, nothing persistent on disk, this is another Terraform security implementation.

When you use workspace in s3 bucket it automatically saves like env/workspacename/key-value-you provided. In default workspace, it saves directly key-value-you provided

The state files for all of your workspaces are stored in the same backend (e.g., the same S3 bucket). That means you use the same authentication and access controls for all the workspaces, which is one major reason workspaces are an unsuitable mechanism for isolating environments

The use of separate folders makes it much clearer which environments you’re deploying to, and the use of separate state files, with separate authentication mechanisms, makes it significantly less likely that a screw-up in one environment can have any impact on another.

change the key to the same folder path as the webserver Terraform code:
stage/services/webserver-cluster/terraform.tfstate

This gives you a 1:1 mapping between the layout of your Terraform code in version control and your Terraform state files in S3

In an internal module, you will notice that there is no provider block in this configuration. When Terraform processes a module block, it will inherit the provider from the enclosing configuration. Because of this, we recommend that you do not include provider blocks in modules.

File Layout



I recommend using separate Terraform folders (and therefore separate state files) for each environment (staging, production, etc.) and for each component (VPC, services, databases). VPC never changes so why make a load of these modules to others. So separate each component.

When creating a module, you should always try to use a separate resource instead of the inline-block. Otherwise, your module will be less flexible and configurable.

*Terragrunt is used to run multiple folder-based infra at one apply command

But if the app code and database code live in different folders, as I’ve recommended, you can no longer use them to connect. Fortunately, Terraform offers a solution: the terraform_remote_state data source. There is another data source that is particularly useful when working with the state: terraform_remote_state. You can use this data source to fetch the Terraform state file stored by another set of Terraform configurations in a completely read-only manner.

data "terraform_remote_state" "vpc" {
  backend = "remote"

  config = {
    organization = "hashicorp"
    workspaces = {
      name = "vpc-prod"
    }
  }
}

resource "aws_instance" "foo" {
  # ...
  subnet_id = data.terraform_remote_state.vpc.outputs.subnet_id
}
Enter fullscreen mode Exit fullscreen mode

What if? If both your staging and production environment are pointing to the same module folder, as soon as you make a change in that folder, it will affect both environments on the very next deployment. To solve using versioned modules.

$ git tag -a "v0.0.1" -m "First release of webserver-cluster module"
$ git push --follow-tags

In Github, you can use the GitHub UI to create a release, which will create a tag under the hood.

In the Git repo github.com/foo/modules (note that the double-slash in the following Git URL is required):
source = "github.com/foo/modules//webserver-cluster?ref=v0.0.1"

I generally recommend using Git tags as version numbers for modules. Branch names are not stable, as you always get the latest commit on a branch, which may change every time you run the init command, and the sha1 hashes are not very human-friendly. Git tags are as stable as a commit (in fact, a tag is just a pointer to a commit), but they allow you to use a friendly, readable name.

A particularly useful naming scheme for tags is semantic versioning. This is a versioning scheme of the format MAJOR.MINOR.PATCH (e.g., 1.0.4) with specific rules on when you should increment each part of the version number.

In a private Git repository, to use that repo as a module source, you need to give Terraform a way to authenticate to that Git repository. I recommend using SSH auth so that you don’t need to hardcode the credentials for your repo in the code itself.

git@github.com:acme/modules.git//example?ref=v0.1.2

This is how you can call modules in main.tf:
module "webserver_cluster" {
source = "git@github.com:foo/modules.git//webserver-cluster?ref=v0.0.2"
cluster_name = "webservers-stage"
db_remote_state_bucket = "(YOUR_BUCKET_NAME)"
db_remote_state_key = "stage/data-stores/mysql/terraform.tfstate"
instance_type = "t2.micro"
min_size = 2
max_size = 2
}

Not to Store secrets in Variables



  • 1st option Store secrets, such as database passwords, in AWS Secrets Manager. Then use like this:
...
    password = data.aws_secretsmanager_secret_version.db_password.secret_string
}

data "aws_secretsmanager_secret_version" "db_password" {
    secret_id = "mysql-master-password-stage"
}
Enter fullscreen mode Exit fullscreen mode

*In GCP - Google Cloud KMS and the google_kms_secret data source
*In Azure - Key Vault and the azurerm_key_vault_secret data source

  • 2nd option The second option for handling secrets is to manage them completely outside of Terraform (e.g., in a password manager such as 1Password, LastPass, or OS X Keychain) and to pass the secret into Terraform via an environment variable.

$ export TF_VAR_db_password="(YOUR_DB_PASSWORD)"
*intentionally a space before the export command to prevent the
secret from being stored on disk in your Bash history

  • 3rd option Store it in command-line–friendly secret store, such as pass $ export TF_VAR_db_password=$(pass database-password) $ terraform apply

*Secrets Are Always Stored in Terraform State is a one security weakness

Use Templates



How to use template
data "template_file" "user_data" {
template = file("user-data.sh")
vars = {
server_port = var.server_port
db_address = data.terraform_remote_state.db.outputs.address
db_port = data.terraform_remote_state.db.outputs.port
}
}

`$cat user-data.sh

!/bin/bash

cat > index.html <<EOF

Hello, World

DB address: ${db_address}

DB port: ${db_port}

EOF
nohup busybox httpd -f -p ${server_port} &`

Paths - Terraform supports the following types of path references:

  • path.module Returns the filesystem path of the module where the expression is defined.
  • path.root Returns the filesystem path of the root module.
  • path.cwd

Returns the filesystem path of the current working directory. In normal use of Terraform, this is the same as path.root, but some advanced uses of Terraform run it from a directory other than the root module directory, causing these paths to be different.

template = file("${path.module}/user-data.sh")

Use Locals
Values in the security group, including the “all IPs” CIDR block 0.0.0.0/0, “any port” value of 0, and the “any protocol” value of "-1" are also copied and pasted in several places throughout the module. Having these magical values hardcoded in multiple places makes the code more difficult to read and maintain. you can define these as local values in a locals block:
locals {
http_port = 80
any_port = 0
any_protocol = "-1"
tcp_protocol = "tcp"
all_ips = ["0.0.0.0/0"]
}

And use like this
port = local.http_port

Locals make your code easier to read and maintain, so use them often

Test



There are three types of automated tests:
Unit tests - verify the functionality of a single, small unit of code. The definition of unit varies, but in a general-purpose programming language, it’s typically a single function or class. Usually, any external dependencies—for example, databases, web services, even the filesystem—are replaced with test doubles or mocks that allow you to finely control the behavior of those dependencies (e.g., by returning a hard-coded response from a database mock) to test that your code handles a variety of scenarios.

Integration tests - verify that multiple units work together correctly. In a generalpurpose programming language, an integration test consist of code that validates that several functions or classes work together correctly. Integration tests typically use a mix of real dependencies and mocks: for example, if you’re testing the part of your app that communicates with the database, you might want to test it with a real database, but mock out other dependencies, such as the app’s authentication system.

End-to-end tests - involve running your entire architecture—for example, your apps, your data stores, your load balancers—and validating that your system works as a whole. Usually, these tests are done from the end-user’s perspective, such as using Selenium to automate interacting with your product via a web browser. End-to-end tests typically use real systems everywhere, without any mocks, in an architecture that mirrors production (albeit with fewer/smaller servers to save money).

Each type of test serves a different purpose and can catch different types of bugs, so you’ll likely want to use a mix of all three types. The purpose of unit tests is to have tests that run quickly so that you can get fast feedback on your changes and validate a variety of different permutations to build up confidence that the basic building blocks of your code (the individual units) work as expected. But just because individual units work correctly in isolation doesn’t mean that they will work correctly when combined, so you need integration tests to ensure the basic building blocks fit together correctly. And just because different parts of your system work correctly doesn’t mean they will work correctly when deployed in the real world, so you need end-to-end tests to validate that your code behaves as expected in conditions similar to production.

*Go library called Terratest, which supports testing a wide variety of infrastructure as code tools (e.g., Terraform, Packer, Docker, Helm) across a wide variety of environments (e.g., AWS, Google Cloud, Kubernetes). It is a bit like a Swiss Army knife, with hundreds of tools built in that make it significantly easier to test infrastructure code, including first-class support for the test strategy just described, where you terraform apply some code, validate that it works, and then run terraform destroy at the end to clean up.

Note: no form of testing can guarantee that your code is free of bugs, so it’s more of a game of probability.

For testing you have to deploy terraform code - test - then destroy. Test tool could be curl, MySQL client, VPN client, or ssh into theserver.

I strongly recommend that every team sets up an isolated sandbox environment, in which developers can bring up and tear down any infrastructure they want without worrying about affecting others. In fact, to reduce the chances of conflicts between multiple developers (e.g., two developers trying to create a load balancer with the same name), the gold standard is that each developer gets their own completely isolated sandbox environment. For example, if you’re using Terraform with AWS, the gold standard is for each developer to have their own AWS account that they can use to test anything they want

To keep costs from spiraling out of control, key testing takeaway regularly cleans up your sandbox environments like using cloud-nuke. An open-source tool that can delete all the resources in your cloud environment
cloud-nuke aws --older-than 48h

Here’s a quick way to check the health of your Terraform code: go into your live repository, pick several folders at random, and run terraform plan in each one. If the output is always, “no changes,” that’s great, because it means that your infrastructure code matches what’s actually deployed.



Automate


The recommended way to run terraform through any CICD pipeline instead of running commands manually. Use any of your fav pipelines tools like Jenkins or Tekton or any other. That pipeline should have the option to all the basic commands which Terraform has. The pipeline should have the capability to send emails to approvers when terraform apply command runs.

terraform init -input=false to initialize the working directory.
terraform plan -out=tfplan -input=false to create a plan and save it to the local file tfplan.
terraform apply -input=false tfplan to apply the plan stored in the file tfplan

It may be necessary to use the -var and -var-file options on terraform plan

Final thoughts



Terraform recommends not to create DR site. Because if any disaster happens then it will recreate it on the fly.

The DevOps world is full of fear: fear of downtime; fear of data loss; fear of security breaches. Every time you go to make a change, you’re always wondering, what will this affect? Will it work the same way in every environment? Will this cause another outage? And if there is an outage, how late into the night will you need to stay up to fix it this time? As companies grow, there is more and more at stake, which makes the deployment process even scarier, and even more error-prone. Many companies try to mitigate this risk by doing deployments less frequently, but the result is that each deployment is larger, and actually more prone to breakage.

Instead of telling your boss that Terraform is declarative, talk about how your team will be able to get projects done faster. Instead of talking about the fact that Terraform is multi cloud, talk about the peace of mind your boss can have to know that if you migrate clouds someday, you won’t need to change all of your tooling. And instead of explaining to your boss that Terraform is open source, help your boss see how much easier it will be to hire new developers for the team from a large, active open source community.

Adopting IaC has a relatively high cost, and although it will pay off in the long term for some scenarios

Never try to change even variable name - it causes downtime

The Golden Rule of Terraform:
The main branch of the live repository should be a 1:1 representation of what’s actually deployed in production.

After you begin using Terraform, do not make changes via a web UI, or manual API calls, or any other mechanism.

Once you start using Terraform, you should only use Terraform
If you have existing infrastructure, use the import command

Here’s what the process looks like for promoting, for instance, v0.0.6 of a Terraform module across the Dev, Stage, and Prod environments:

  1. Update the Dev environment to v0.0.6 and run terraform plan.
  2. Prompt someone to review and approve the plan; for example, send an automated message via Slack.
  3. If the plan is approved, deploy v0.0.6 to Dev by running terraform apply.
  4. Run your manual and automated tests in Dev.
  5. If v0.0.6 works well in Dev, repeat steps 1–4 to promote v0.0.6 to Staging.
  6. If v0.0.6 works well in Staging, repeat steps 1–4 again to promote v0.0.6 to Production.

Recommendation



If you want to enhance your skills in Terraform and Iac then here are my recommendations:

✅ Blog: https://spacelift.io/blog/what-are-terraform-modules-and-how-do-they-work
✅ Website: https://www.terraform-best-practices.com/
✅ Follow this guy: https://www.youtube.com/c/AntonBabenkoLive
✅ Read this book: https://www.oreilly.com/library/view/terraform-up/9781492046899/

Discussion (0)