DevOps with Ansible (4 Part Series)
This post was originally published at thbe.org.
In preparation for a meeting to discuss ways of working to optimize server operations in a mid-size data center, I was seeking a lightweight approach to automate server installations. In the past, I was primarily using Puppet for data center management. Although it’s still a good choice for an enterprise-level, full automation, it also has some disadvantages. First of all, it’s a client/server infrastructure that requires by nature more installation effort compared to clientless solutions. It also requires more resources on the client to permanently get the state of the client. Last but not least, it’s also a bit more complex when it comes to writing generic usable modules.
Long story short, I decided to give the number one competitor of Puppet called Ansible a try. Ansible has a similar, slightly smaller scope and is designed to be more lightweight then Puppet. You don’t need to install agents on the clients and the whole architecture looks a little bit straighter forward.
As proof of concept, I built a small automation procedure for my home office. As Ansible is owned by RedHat I used CentOS as the server operating system. CentOS can be automatically installed by using kickstart and is as such, a good platform to test the general idea. As a first step, I’ve created a generic kickstart file that enables me to install CentOS 7 with a minimal package set (the kickstart for CentOS 8 should look quite similar). This is my preferred approach as I tend to use the automation tool for the setup work. In my proof of concept, I use the root user for the configuration management. In a real-world scenario, you should use a dedicated configuration management user in conjunction with sudo:
#version=DEVEL # # Kickstart installation file with minimal package set # # Author: Thomas Bendler <firstname.lastname@example.org> # Date: Wed Jan 2 17:25:26 CET 2019 # Revision: 1.0 # # Distribution: CentOS # Version: 7 # Processor: x86_64 # Kickstart settings install cdrom text reboot # System language settings lang en_US.UTF-8 keyboard --vckeymap=de-nodeadkeys --xlayouts='de (nodeadkeys)' timezone Europe/Berlin --isUtc # System password for root (create with i.e. pwkickstart) rootpw --iscrypted $6$aaazzz # System bootloader configuration bootloader --append=" crashkernel=auto" --location=mbr --boot-drive=sda ignoredisk --only-use=sda # System partition information clearpart --all --initlabel --drives=sda autopart --type=lvm # System network settings network --bootproto=dhcp --device=link --onboot=on --ipv6=auto --hostname=node_1.local.domain # System security settings auth --enableshadow --passalgo=sha512 selinux --permissive firewall --service=ssh # System services services --enabled="chronyd" # System packages %packages @^minimal @core chrony kexec-tools %end # System addons %addon com_redhat_kdump --enable --reserve-mb='auto' %end # Post installation settings %post --log=/root/kickstart-post.log /usr/bin/logger "Starting anaconda postinstall" /usr/bin/mkdir /root/.ssh /usr/bin/chmod 700 /root/.ssh /usr/bin/echo "ssh-rsa AAA[...]ZZZ admin1@management" > /root/.ssh/authorized_keys /usr/bin/chmod 400 /root/.ssh/authorized_keys sync exit 0 %end
The last part of the kickstart file is used to distribute the authorized_keys file to enable passwordless login to the root account. Again, in a real-world scenario, you should use this section to create a dedicated management user as well as the required sudo configuration. Alternatively, you can use Ansible to create individual administrative user accounts on the target hosts.
The now still remaining part is to tell the installer that he should use the kickstart file instead of asking the user how and what to install. This is simply done by adding inst.ks= to the kernel parameter. The installer can use different methods to get the file which is described in the kickstart documentation. In this proof of concept I simply use a web target:
Once the Kickstart based installation is finished, the configuration management done with Ansible will take over. If you move the proof of concept to a real datacenter you should consider extending the automated Kickstart installation by tools like Cobbler. This will enable fully-fledged automation by using PXE boot (network boot without a local operating system). Another option, depending on your security and update strategy, is to use templates to provision the operating system instead of doing a real installation. But this approach has, at least from my point of view, too many disadvantages (especially in the operating area).
So far, I showed how to automate the installation process of the core operating system. When this process is done, we should have a minimum operating system installation with a bit of basic configuration (mainly access related). As already said, this will be the point were Ansible will perform the final configuration. I won't provide a manual on how to use Ansible in this post (you can find the complete documentation at https://docs.ansible.com/ansible/latest/index.html), instead, I would show some basic concepts how the tool is used for its purpose. To get an idea of how things are organized in Ansible, let's first take a look at the folder structure of the proof of concept:
├── group_vars │ ├── all.yml │ ├── datacenter1.yml │ └── location1.yml ├── host_vars │ └── server1.domain.local │ └── network.yml ├── playbooks │ ├── add_ssh_fingerprints.yml │ └── reboot_hosts.yml ├── roles │ ├── common │ │ ├── tasks │ │ │ ├── environment.yml │ │ │ ├── localtime.yml │ │ │ ├── main.yml │ │ │ ├── motd.yml │ │ │ ├── networking.yml │ │ │ ├── repositories.yml │ │ │ ├── tools.yml │ │ │ ├── upgrade.yml │ │ │ └── user.yml │ │ └── templates │ │ ├── custom_sh.j2 │ │ ├── ifcfg-interface.j2 │ │ ├── motd.j2 │ │ ├── network.j2 │ │ └── route-interface.j2 │ └── web │ ├── handlers │ │ └── main.yml │ ├── tasks │ │ └── main.yml │ └── templates │ ├── index_html.j2 │ └── nginx_conf.j2 ├── common.yml ├── production ├── site.yml ├── staging └── web.yml
Ok, let's go through this tree step by step. The first two directories, group_vars and host_vars contain the variables used in the configuration scripts. The idea is to define the scope of the variables. They can apply to either all hosts, to a data center, a region/location or only to one specific host. Variables that apply to all managed hosts can be for example local administrative user IDs. Variables that apply to a region/location can be for example settings in which time-server should be used by the hosts in that region. Datacenter related variables can be related to the routing for example. And last but not least, host-specific variables can be related to IPs, hostnames and so on and so forth. This enables the administrator to easily set up global infrastructures without much effort.
Before I move on to the playbooks I would like to highlight two files in the root directory. Those are staging and production. Both files contain the servers and groups (region, location, role) where the servers belong to. The distinction between staging and production reflects the role of those (here you can see the staging file):
[hamburg-hosts] node_1.local.domain ansible_ssh_host=172.20.0.4 ansible_user=root node_2.local.domain ansible_ssh_host=172.20.0.5 ansible_user=root node_3.local.domain ansible_ssh_host=172.20.0.6 ansible_user=root [hamburg-web] node_4.local.domain ansible_ssh_host=172.20.0.7 ansible_user=root node_5.local.domain ansible_ssh_host=172.20.0.8 ansible_user=root node_6.local.domain ansible_ssh_host=172.20.0.9 ansible_user=root node_7.local.domain ansible_ssh_host=172.20.0.10 ansible_user=root [berlin-web] node_8.local.domain ansible_ssh_host=172.20.0.11 ansible_user=root node_9.local.domain ansible_ssh_host=172.20.0.12 ansible_user=root node_10.local.domain ansible_ssh_host=172.20.0.13 ansible_user=root [web:children] hamburg-web berlin-web [hamburg:children] hamburg-hosts hamburg-web [berlin:children] berlin-web
The hosts in the staging file can be addressed by the names in the brackets. The exception is the colon entries. Those represent a collection of groups for dedicated hosts, for example, all groups located in Hamburg as shown with the entry [hamburg:children]. The hosts can be accessed with the keyword hamburg. The group names are in a relationship with the variables and this, in the end, closes the circle. Now that we know how things are controlled, we can follow up with the answer to the question of how things are done.
Before we start with the playbooks that define how things are done, some words on the ad-hoc administration. Ansible can also be used to do ad-hoc administration based on the inventory. This gives an administrator the ability to perform activities based on the region or the datacenter or on all hosts at once. Let's assume we've updated an alias (CNAME) on our central DNS server. Before we use the updated alias everywhere in our landscape we would like to know if every host is already linked to the updated alias. Instead of logging into each and every host, executing dig, we can do it in one rush with Ansible:
ansible -i staging all -a "/bin/dig updated-cname.local.domain +short"
Another task would be to check if all servers in the inventory are reachable or if a specific service is running on the server. This will more or less look like this:
ansible -i staging all -m ping ansible -i staging all -m shell -a "/bin/ps aux | /bin/grep chronyd"
The playbooks itself can be seen as a grouping for the ad-hoc tasks. Instead of doing action after action manually, playbooks gives you the ability to group and combine several activities and execute it step by step as a whole instruction set. It's a kind of automating the things the administrator did previously with checklists. A sample playbook that could be used to first install glances and then reboot a host one minute after the execution of the playbook looks like this:
--- - hosts: all become: yes become_user: root gather_facts: false tasks: # Install glances - name: Install glances yum: name=glances state=latest # Reboot the target host - name: Reboot the target host command: /sbin/shutdown --no-wall -r +1 "Reboot was triggered by Ansible"
The file is based on the YAML format and contains one or more activities based on a scope. It can be executed like this:
ansible-playbook -i staging ./example_playbook.yml
For the time being, this is already sufficient to transition the widely used checklist approach to an automated approach. But this is only the tip of the iceberg, automation, once it is in place, can do a lot more for you which brings us finally back to the folder structure you find in the middle of this post. When I find some time I will create another post and dive a little bit deeper into the automation possibilities, especially into the playbooks and roles.