Terraform is a popular, open-source infrastructure as a code software tool.
This article aims to present a few tips on how to use it, based on hands-on experience. Readers are assumed to have at least some level of Terraform working knowledge.
Let's say we created a bunch of Terraform scripts. Most probably we keep them in the repository of our choice. By doing so, we can easily share them between team members.
The question arises: what about the State?
By default, it's stored in a file in a current working directory where Terraform was run. Should it be pushed to the repository together with Terraform scripts?
Actually, it's not a best idea. State file is machine-generated and there is a significant probability of frequent merge conflicts between different revisions. Those conflicts will have to be resolved by hand and it won't be easy.
There are two options to handle this:
- Local state - state kept as a file in a shared location. Sharing can be achieved with the use of a network-attached storage. Or there can be one dedicated "builder" machine reused by the whole team.
- Remote state - state kept on a remote storage. This is a feature of Backends, and there are several of them to choose from. What's good to check and be aware of is whether given Backend supports locking mechanism (for example Oracle Object Storage currently doesn't). Locking mechanism is a measure to avoid two or more different users accidentally running Terraform at the same time, and thus ensure that each Terraform run begins with the most recent updated State.
Whatever Backend we use, and regardless of whether it supports the locking mechanism or not, if two users run the same set of terraform scripts, which are out of sync, we are in a trouble.
Let's imagine a situation where two developers pull the same scripts from the repository. Developer A modify scripts by adding an additional Compute instance. She runs the scripts and the instance is provisioned. Shared state is updated.
A few minutes after that, Developer B runs his version of the scripts (which he didn't modify). Terraform compares the content of shared state with the content of the scripts and finds out that:
- Instance was provisioned on the infrastructure [information from the State]
- There is no instance in the current scripts
Based on the above, Terraform comes to conclusion that the Compute instance has to be decomissioned. Obviously, this is not what we expected.
To prevent such situations, one must make sure that Terraform is run always using up-to-date scripts. It can be done by defining a manual process or with a tool.
CI/CD pipeline or a job can be created for that purpose. Or specific service can be used like Resource Manager, that is part of OCI offering.
Two things should be taken into consideration here: avoiding redundancy and planning for efficient use.
For redundancy part, one should consider:
- Moving common elements to modules to promote reusability
- Using variables to parametrise the scripts
All sensitive data should be removed from the scripts and loaded from external variables. In this post I'm talking about one possible approach to do that in a safe way: Link
When it comes to efficiency, we should first reflect on how we are going to provision and decommission our infrastructure. Things that we want to provision/deprovision together should obviously go together in the scripts.
However, at the same time, we should keep in mind that in Terraform we usually use "everything or nothing" approach. In other words - either we provision everything or nothing. The same holds true for decommissioning. Of course, there are ways to narrow down the scope (the "-target" option can be used to focus Terraform's attention on only a subset of resources), but it should be treated more like and exception than a rule.
So, it's better to have a few independent set of scripts which we can run separately and orchestrate as needed, even if they are tightly coupled and pertain to the same piece of the software and infrastructure.
For example, let's say we want to provision Kubernetes cluster. Instead of putting everything into one big set of scripts, we can divide it into following components:
- Scripts to provision identity provider
- Load Balancer
- Image registry
- Control + data plane
- Extensions like storage, cert manager, etc.
Each component, from the list above, is a complex thing. It's good to have a possibility to approach them separately or together, depending on a need.
Frequent question is: which tool should be used, Ansible or Terraform?
To answer it let's first make a differentiation between management of:
- Infrastructure - vm, storage, networking etc.
- Configuration - software installed on top of the infrastructure
So, we can definitely use either tool to cover both areas. It especially makes sense for easy use cases. For example, we can go with Terraform only and use cloud-init/provisioners for configuration management.
However, in more complex situations, in my opinion it's good to use the tools for what they were originally designed for, which means: Terraform for infrastructure and Ansible for configuration management. It just makes things easier and more natural.
And last important advice: regardless of the tool used, scripts should be idempotent. It increases a bit effort to implement them (especially in case of Ansible) but pays off greatly later on.