Recently my department has been adopting Ansible into more of our workflows. I like it so much that I converted my dotfiles repo over to use Ansible. Notwithstanding our intrepid exhilaration at having so much more of our workflow inducted into an automated process, we recently were made privy to the fact that our visibility into what Ansible is actually doing to our infrastructure is less than perfect.
Rewind to Monday morning, when I started receiving email complaints that one of our batch jobs had failed to complete. After pouring over a monolithic log file, I discovered that the issue was that the user attempting to execute one of our scripts did not have directory-execute permissions on one of the parent directories that our script resided in. It was a simple fix:
chmod .... The job was released and ran to completion. Still, I wondered what had happened to change the directory permissions, especially since I knew that neither I, nor others, had been making any modifications to that particular system.
Sure enough, after reaching out to our sysadmins, it was confirmed that a playbook had been run with the aforementioned undesired outcome. No biggie: within 15 minutes the problematic playbook had been remedied.
Crisis averted. Conflict resolved. However, I began to go over in my mind how we might avoid this situation in the future, and it gave me an idea: We (conveniently) have our host inventory separated into groups by teams. Wouldn't it be convenient if there was a plugin that could, when a host had been changed, lookup variables on that host to see if any "alerting configuration" had been set, and alert the appropriate stakeholders for a given system?
Let's start with my desired solution and work backward. Say I have the following inventory:
[webteam] webteam1.tld webteam2.tld [systeam] systeam1.tld systeam2.tld
I would like to manage who I want to alert, upon a system change, via
notify_on_change: email: - firstname.lastname@example.org - email@example.com slack: ...slack config...
Fantastic. But Ansible does not come with such functionality out-of-the-box. But wait a minute! It turns out that developing a plugin is relatively painless. It turns out that what I want to develop is called a "callback plugin". A callback plugin is a Python class that gets called during different phases of Ansible's lifecycle. The shortest callback plugin we could write would be the following:
from ansible.plugins.callback import CallbackBase class CallbackModule(CallbackBase): CALLBACK_VERSION = 2.0 CALLBACK_TYPE = 'aggregate' CALLBACK_NAME = 'custom'
This callback plugin will not do anything, but there are a few takeaways to note already: this callback plugin is advertising that:
- it wishes to consume the v2 API
- its type is
aggregate: a callback plugin's type is important. More specifically, what type a callback plugin is not is even more important, as only one callback plugin of type
stdoutcan be enabled at once.
- its name is
custom. We will refer to the callback plugin later by this name. As far as I know, the filename (e.g.,
whatever.py) is not used to refer to the callback plugin
As previously stated, multiple callbacks can be active at once, but only one callback plugin of type
stdout can be active at once. So how do we enable a callback plugin? Easy, jump into your friendly
ansible.cfg, and set the following config:
[defaults] # ... # comma separate callback plugins to whitelist: callback_whitelist = custom
This will instruct Ansible to enable the callback plugin that we have defined. But where do these callback plugins go? Once again, Ansible has an easy answer to this: in the
callback_plugins/ directory, either 1) within one of the roles you include, or 2) adjacent to your playbook(s). So far, a possible directory structure could look like the following:
my-playbook-folder/ callback_plugins/ my-callback-plugin.py ansible.cfg playbook.yml
Alright, so what exactly should our plugin do? To keep things simple, we will specify the behavior of our plugin to do the following:
- Use Slack as a notification service
- Do not send a notification unless a Task reports that it "changed" the target system
- If a change occurred, lookup the stakeholders for a given system using the particular hosts host-/group- vars.
This is where things start to get slightly hairy: in order to inspect the state of the Play(s) and Task(s), we will be digging into Ansible internals. Thankfully, there exists a great community to support you in this endeavor, but let us start by mentioning that all the methods that your callback plugin can override can be found in in the
CallbackBase class. After a bit of digging around, and a little waiting for an answer on a StackOverflow question, I settled on the following base code:
from ansible.plugins.callback import CallbackBase class CallbackModule(CallbackBase): CALLBACK_VERSION = 2.0 CALLBACK_TYPE = 'aggregate' CALLBACK_NAME = 'is' def v2_playbook_on_play_start(self, play): self.vm = play.get_variable_manager() def v2_runner_on_ok(self, result): if self.vm.get_vars()['ansible_check_mode']: return host_vars = self.vm.get_vars()['hostvars'][result._host.name] if 'notify_on_change' in host_vars and result.is_changed(): pass # TODO: Post Slack message
WARNING: I do NOT know Ansible internals and have found the above code to "work" after a lot of trial and error. My impression is that host and group vars are instantiated and merged on a per-play basis. That is why the variable manager is captured "on play start" and stored for later use. Later, when a task finishes, we use the variable manager to check 1) whether Ansible is running in
--check-mode, and then later to retrieve the specific host's
So this is the part of the article where I get irrationally angry that Python does not support type-annotations. If it did, I might be able to engage in rapid discovery surrounding exactly what I could pull out of the parameters being passed to my callback plugin's hooks. Who knows, I might even be able to use IDE hints like IntelliSense or Content Assist to quickly retrieve what I need.
If you could see me at this moment, my face red with rage, sweat dripping down my neck, you might notice that I am calming down and reminding myself of the merits of scripting languages. In fact, one of my favorite rapid-development-tools is Node.JS. At this point, I set my bias aside and trudge on. Time to pull out some debugging-fu.
After a little poking around, there is a one-line technique you can use to suspend any Python program and invoke an interactive debugger:
import pdb; pdb.set_trace()
Fantastic! I can now drop this into my plugin at a point where I would like to inspect vars and iterate (very slowly) until I get some code that works.
class CallbackModule(CallbackBase): # ... def v2_runner_on_ok(self, result): import pdb; pdb.set_trace() # ... same code as before # ...
Now, if I invoke Ansible like normal, it will run until it calls my
v2_runner_on_ok method, and it will immediately suspend, dropping me into a request-evaluate-print-loop (REPL) where I can interactively step through my code and inspect variables in my scope!
I'm going to end this article here, as it is already turning out longer than I had planned. Even if it was not for that, I would still have to end it, as that is all I have implemented at this point. Hopefully I will finish my callback plugin soon (or one of you will beat me to it), and I will share my findings and code in a "Part II".