This post was originally published at thbe.org.
One of the things I've done most in the last two decades in the IT business was writing shell scripts. Was it because to combine and condense different data sources or was it to get a custom reporting or to encrypt/ decrypt data. In the end, there was a wide range of different use cases where a small script helped to achieve the outcome I was seeking for. Often these scripts were used to support parts of global processes with high financial or legal impact on the business. There is nothing wrong with this as long as everything is working. Unfortunately, the only constant in the IT business is the change and those scripts that have done their job for years silently in the background stop out of the sudden to work. In a well managed IT you know where those scripts are being used, you know the purpose of each script and you can easily adapt them to the new environment. At least, this is the theory, let's have a look at a real-life example:
#!/bin/sh
set -x
TC8=$(ps -ef | grep tomcat)
if ! [ -n "$TC8" ] ; then
/home/portal/tc8/bin/catalina.sh start &
exit
fi
Time to talk about enterprise-grade shell scripts. I'm pretty sure the script has done its job when it was deployed to the production machine, but, does anyone know what the purpose of this script is? With some experience in this area, you will recognize that the script is used to do something with a Tomcat instance. Based on the path I would assume it's a JAVA portal running on a Tomcat in the namespace of the user "portal". But why does it use the ampersand and the exit command and why is it only sometimes working? As you already noticed, this script is an example of how enterprise-grade scripts should not look like.
Header
This brings us to the question, how should an enterprise-grade shell script look like? First and most obvious, some additional context would be helpful. What is the purpose of that script, who wrote it and when? A typical header that I use in my scripts looks like this:
#
# Author: Thomas Bendler <code@thbe.org>
# Date: Fri Apr 17 22:48:34 CEST 2020
#
# Note: To debug the script change the shebang to: /usr/bin/env bash -vx
#
# Prerequisite: This release needs a shell that could handle functions.
# If shell is not able to handle functions, remove the
# error section.
#
# Release: 1.0.0
#
# ChangeLog: v0.1.0 - Initial release
# v0.9.0 - Prepare go-live
# v1.0.0 - Production go-live
#
# Purpose: Watchdog for a Tomact 8 instance
#
Without reading a single line of code, you know who has written the script and when the script was written. You know which version of the script is deployed, you know the version history for the script, the purpose and so on and so forth. Depending on the processes in the company it could be handy to add additional information like the URL to the GIT repository or which deployment pipeline has been used or approvals or other information. Whatever is used, it should be used in every script and the structure of the information should be equal in every script.
Shebang
A script could contain different technologies. It could contain a shell script, a Python script, a Ruby script or something else. To tell the operating system which kind of script it is, the so-called "Shebang" is used. The "Shebang" is the hashtag plus an exclamation mark followed by the interpreter that should be used to execute the script. In the real-life, you'll find lots of "Shebang" like the following:
#!/usr/bin/local/ksh
Here we see that someone installed manually a Korn-Shell on that box (because it's in /usr/local/bin/) and that the script uses this shell. This works in principle but it's not the way how an enterprise-grade shell script should look like. Imagine that at some point in time we want to deploy the script on another box. On this box, the Korn-Shell was installed through the package manager. This usually means that the path to the shell is in this case /usr/bin/ksh. Before we can deploy the script we need to change the "Shebang". The better approach to tackle this is to use env binary instead. You'll find env on every box under /usr/bin/env. As a parameter env will use the command that should be used for the Shebang. So assuming I would like to use the Korn-Shell, the Shebang would be:
#!/usr/bin/env ksh
Env knows the path to ksh and will call the correct binary for the "Shebang". This will also work for Perl/ Python or other script interpreters.
Script behavior
Regardless of the programming language, it's always beneficial if you know how a script behaves. When it stops, when it aborts, what is allowed, whatnot, and so on and so forth. This is controlled by the set command in the script and should be equally set in all scripts:
### General script behavior ###
set -euo pipefail
In the end, this is the combination of three arguments which are explained here:
-e stops the script after the first command has failed
-u stops the script after the first unset variable has been found
-o pipefail stops the script after the first piped command has failed
Error handling
A central element of an enterprise-grade shell script is a proper error handling. Errors can happen and they happen unfortunately more often than anybody want so it's key to deal with them in a predictable way. I use a function called error_handling() to achieve this:
### Error handling ###
error_handling() {
if [ "${RETURN_CODE}" -eq 0 ]; then
echo_verbose "${SCRIPT_NAME} successfull!"
else
echo_error "${SCRIPT_NAME} aborted, reason: ${EXIT_REASON}"
echo; script_usage
fi
exit "${RETURN_CODE}"
}
trap "error_handling" EXIT HUP INT QUIT TERM
RETURN_CODE=0
EXIT_REASON="Finished!"
The way of working for this function is quite simple. Whenever one of the signals is raised that you see at the end of the trap line, the function error_handling() is called. You also see an important rule for writing enterprise-grade ready scripts, the use of meaningful names for functions, variables, constants, and other script components. The purpose of the variable ${SCRIPT_NAME} is much easier to understand compared to ${0}.
Output
You might have also noticed, that I use additional functions in the error handling function:
### Print out information if in verbose mode ###
echo_verbose() { if [ ${ARGUMENT_VERBOSE} -eq 1 ]; then echo "${@}"; fi }
### Print out information on error channel ###
echo_error() { echo "${@}" >&2; }
Good enterprise-grade scripts run silent by default and start being verbose if called with the respective option set. The second function takes care that the output in case of an error is redirected to STDERR instead of STDOUT. This enables calling programs to separate the normal script output from the error messages.
Dry run
Another good practice is to offer the possibility of a dry run:
### Don't execute commands if in dry run mode ###
execute_command() {
if [ ${ARGUMENT_DRYRUN} -eq 1 ]; then
echo "Command to execute: ${*}"
else
"${@}" || COMMAND_RETURN_CODE=${?}; return ${COMMAND_RETURN_CODE}
fi
}
COMMAND_RETURN_CODE=0
When the execute command function is used and the script was called with the dry run flag, the command is only displayed and not executed. Unfortunately, this function isn't that straight forward and requires some more thinking before widely use it especially when using pipes or redirects.
Usage
Now where we have covered the execution helpers, we need to provide a function for the script usage:
### Print out usage information ###
script_usage() { cat <<EOT
usage: ${SCRIPT_NAME} [-n] [-v] [-h]
example: ./${SCRIPT_NAME} -v
arguments (optional):
-n: Dry run
-v: Be verbose
-h: Print this help
EOT
}
Defaults
With all the functions in place, we can start with the script code itself. The first thing that needs to be done is to initialize the variables because otherwise, the script could fail because of uninitialized variables:
### Default script variables ###
export LC_ALL=C
export LANG=C
ARGUMENT_DRYRUN=0; ARGUMENT_VERBOSE=0
SCRIPT_NAME=$(basename ${0})
Setting the language variables to an explicit value is as well a common good shell scriptwriting practice. This makes the output of called programs much more predictable which is beneficial if the output is used for other actions as well.
Arguments
The next step takes care of the arguments that might have been passed to the script during execution:
### Get the arguments used at script execution ###
while [ ${#} -ne 0 ]; do
case "${1}" in
-n|--dry-run) ARGUMENT_VERBOSE=1; ARGUMENT_DRYRUN=1 ;;
-v|--verbose) ARGUMENT_VERBOSE=1 ;;
-h|--help) script_usage; exit ;;
esac;
shift
done
The code snippet is pretty straight forward, it checks if arguments exist and process each argument passed to the script.
Prerequisites
The last part before the script code starts is the check if the prerequisites has been matched:
### Check script prerequisite ###
TOMCAT_CATALINA_SCRIPT=/home/portal/tc8/bin/catalina.sh
if ! [ -x ${TOMCAT_CATALINA_SCRIPT} ]; then
RETURN_CODE=1
EXIT_REASON="The Tomcat catalina script (${TOMCAT_CATALINA_SCRIPT}) is not executable, aborting!"
exit
fi
Main script logic
Now as we have everything covered and in place it's time to implement the script logic:
### Check if Tomcat is running and start Tomcat if stopped ###
TOMCAT_STATUS=$(ps -ef | grep -E "[t]omcat " || echo "no")
echo_verbose "Is Tomcat running: ${TOMCAT_STATUS}"
if [ ${TOMCAT_STATUS} == "no" ] ; then
echo_verbose "Try to start Tomcat ..."
execute_command ${TOMCAT_CATALINA_SCRIPT} start
if [ ${COMMAND_RETURN_CODE} -ne 0 ]; then
RETURN_CODE=${COMMAND_RETURN_CODE}
EXIT_REASON="Could not analyze ${LOGFILE}, aborting!"
exit
fi
fi
As I mentioned at the beginning of the post, one of the questions was, why does the script sometimes work and sometimes not. The reason for this is the way the script checks if a Tomcat process exists or not. If you do a simple grep on tomcat, grep will find occasionally his own process in the process list that seeks for tomcat. If you instead use regular expressions with grep to match the Tomcat process name, the grep command will be excluded from the result.
The enterprise-grade shell script
Now we can put everything together and deploy the enterprise-grade shell script in a productive environment with high financial impact without having the fear that we operate hidden timebombs that create severe risks in case of failures, especially if developers had left the company meanwhile:
#!/usr/bin/env bash
#
# Author: Thomas Bendler <code@thbe.org>
# Date: Fri Apr 17 22:48:34 CEST 2020
#
# Note: To debug the script change the shebang to: /usr/bin/env bash -vx
#
# Prerequisite: This release needs a shell that could handle functions.
# If shell is not able to handle functions, remove the
# error section.
#
# Release: 1.0.0
#
# ChangeLog: v0.1.0 - Initial release
# v0.9.0 - Prepare go-live
# v1.0.0 - Production go-live
#
# Purpose: Watchdog for a Tomact 8 instance
#
### General script behavior ###
set -euo pipefail
### Error handling ###
error_handling() {
if [ "${RETURN_CODE}" -eq 0 ]; then
echo_verbose "${SCRIPT_NAME} successfull!"
else
echo_error "${SCRIPT_NAME} aborted, reason: ${EXIT_REASON}"
echo; script_usage
fi
exit "${RETURN_CODE}"
}
trap "error_handling" EXIT HUP INT QUIT TERM
RETURN_CODE=0
EXIT_REASON="Finished!"
### Print out information if in verbose mode ###
echo_verbose() { if [ ${ARGUMENT_VERBOSE} -eq 1 ]; then echo "${@}"; fi }
### Print out information on error channel ###
echo_error() { echo "${@}" >&2; }
### Don't execute commands if in dry run mode ###
execute_command() {
if [ ${ARGUMENT_DRYRUN} -eq 1 ]; then
echo "Command to execute: ${*}"
else
"${@}" || COMMAND_RETURN_CODE=${?}; return ${COMMAND_RETURN_CODE}
fi
}
COMMAND_RETURN_CODE=0
### Print out usage information ###
script_usage() { cat <<EOT
usage: ${SCRIPT_NAME} [-n] [-v] [-h]
example: ./${SCRIPT_NAME} -v
arguments (optional):
-n: Dry run
-v: Be verbose
-h: Print this help
EOT
}
### Default script variables ###
export LC_ALL=C
export LANG=C
ARGUMENT_DRYRUN=0; ARGUMENT_VERBOSE=0
SCRIPT_NAME=$(basename ${0})
### Get the arguments used at script execution ###
while [ ${#} -ne 0 ]; do
case "${1}" in
-n|--dry-run) ARGUMENT_VERBOSE=1; ARGUMENT_DRYRUN=1 ;;
-v|--verbose) ARGUMENT_VERBOSE=1 ;;
-h|--help) script_usage; exit ;;
esac;
shift
done
### Check script prerequisite ###
TOMCAT_CATALINA_SCRIPT=/home/portal/tc8/bin/catalina.sh
if ! [ -x ${TOMCAT_CATALINA_SCRIPT} ]; then
RETURN_CODE=1
EXIT_REASON="The Tomcat catalina script (${TOMCAT_CATALINA_SCRIPT}) is not executable, aborting!"
exit
fi
### Check if Tomcat is running and start Tomcat if stopped ###
TOMCAT_STATUS=$(ps -ef | grep -E "[t]omcat " || echo "no")
echo_verbose "Is Tomcat running: ${TOMCAT_STATUS}"
if [ ${TOMCAT_STATUS} == "no" ] ; then
echo_verbose "Try to start Tomcat ..."
execute_command ${TOMCAT_CATALINA_SCRIPT} start
if [ ${COMMAND_RETURN_CODE} -ne 0 ]; then
RETURN_CODE=${COMMAND_RETURN_CODE}
EXIT_REASON="Could not analyze ${LOGFILE}, aborting!"
exit
fi
fi
Final thoughts
Let's conclude this exercise of writing enterprise-grade shell scripts with some thoughts. The first question I normally get, is this over-engineered? It pretty much depends, it's finally all about risk, financial impact, maintainability and so on and so forth. Templates and practices like this are used in environments where the flawless use of scripts to support processes is key. Where it's not acceptable to pause a business for a week because a script doesn't work anymore and no one is able to fix the script. These are the typical scenarios where it is worth the effort to standardized shell scripts as shown and that require those structures before something goes into production. In hobby environments, it's not necessarily required but even there it becomes handy once you have to change a script you've written years before. However you do it finally, enjoy coding and happy scripting!
Top comments (3)
Possible way to avoid pipe issues with dry-run at unix.stackexchange.com/a/433806 where commenter said:
"For piped commands, what I did is to define the $pipe_cmd as 'cat && echo ', otherwise simply repeating echo will swallow the preceding command: echo hello | echo dude will print just dude, while echo hello | cat && echo dude will print both strings."
For 'enterprise-grade' shell scripts, do you tend to validate them through something like BATS?
I recommend proper testing, but how it is achieved depends on your environment. Especially in large enterprises when talking about development, it’s a talk about „real“ developers like the ones building a JAVA app or C++ app or something like this. This also means that the environment focus on those developments. Having this said, I recommend to hijack these existing environments. If they use GIT, use GIT as well. If they use an IDE, use the IDE as well. If they use a CI/ CD, guess what. So, I recommend linter, templates, automated testing, VCS, DevOps and much more but not everything can be implemented if the environment is restricted. In the end, use what is possible and if BATS fit into your environment, you can use BATS as well.