DEV Community

Thomas H Jones II
Thomas H Jones II

Posted on • Originally published at on

Working Around Errors Caused By Poorly-Built AMIs (Networking Edition)

Over the past several years, the team I work on created a set of provisioning-automation tools that we've used with/for a NUMBER of customers. The automation is pretty well designed to run "anywhere".

Cue current customer/project. They're an AWS-using customer. They maintain their own AMIs. Unfortunately, our automation would break during the hardening phase of the deployment automation. After a waste of more than a man-day, discovered the root cause of the problem: when they build their EL7 AMIs, they don't do an adequate cleanup job.

Discovered that there were spurious ifcfg-* files in the resultant EC2s' /etc/sysconfig/network-scripts directory. Customer's AMI-users had never really noticed this oversight. All they really knew was that "networking appears to work", so had never noticed that the network.service systemd unit was actually in a fault state. Whipped out journalctl to find that the systemd unit was attempting to online interfaces that didn't exist on their EC2s ...because, while there were ifcfg-* files present, corresponding interface-directories in /sys/class/net didn't actually exist.

Because our hardening tools, as part of ensuring that network-related hardenings all get applied, does (the equivalent of) systemctl restart network.service, the resultant fault recorded by systemd gets propagated out to the hardening-automation causing the automation to record a fault. Consequently, our tools were aborting on the apparent fault.

So, how to pre-clean the system so that the standard provisioning automation would work? Fortunately, AWS lets you inject boot-time logic via cloud-init scripts. I whipped up a quick script to eliminate the superfluous ifcfg-* files:

for IFFILE in $( echo /etc/sysconfig/network-scripts/ifcfg-* )
   [[-e /sys/class/net/${IFFILE//*ifcfg-/}]] || (
     printf "Device %s not found. Nuking... " "${IFFILE//*ifcfg-/}" && \
     rm "${IFFILE}" || ( echo FAILED ; exit 1 )
     echo "Success!"
Enter fullscreen mode Exit fullscreen mode

Launched a new EC2 with the userData addition. When the "real" provisioning automation ran, no more errors. Dandy.

Ugh... Hate having to kludge to work around error-conditions that simply should not occur.

Discussion (0)