Bash++ Parsing

#bash #linux #devops

Introduction

This morning when I took a walk with my son, I asked him to come up with a Bash issue that needs improvement. He mentioned regex parsing, which I agree is a chore. Afterwards I sat down and wrote the 'regex_read' function, which supplies a convenient 'read' like API for this purpose. I will introduce the 'regex_read' function further below, but first let's review using the builtin 'read' command for parsing.

Parsing with 'read'

A common way to parse input in Bash is to use the builtin 'read' command. This command reads one line (or record) at a time, and then splits out tokens into variables. Consider this typical example:

while read TOK1 TOK2 ALL_ELSE; do
# Do something with $TOK1 and $TOK2
done </some/input/file

Here the file '/some/input/file' is opened and connected to the shell's stdin within the scope of this loop, thus providing input for 'read'. There are implied defaults in this example of which you need to be aware:

Each invocation of 'read' consumes bytes up to and including a delimiter character, which by default is newline. You can change this value with the '-d' switch.
'read' splits each record into tokens according to $IFS, the contents of the input field separator variable.

Usually $IFS includes unprintable characters. If you'd like to know what those characters are, the 'printf' builtin command comes to the rescue:

# Issue command
printf '%q\n' "$IFS"
$' \t\n'

Now we can see that $IFS contains a space, a tab, and a newline. If you need to parse a CSV file, you'll want something like:

IFS=$',\n'
read COL1 COL2 COL3 COL4 ALL_ELSE

This syntax is fairly concise and convenient. If it works for the problem at hand, there is nothing to fix. Sometimes being able to specify the delimiter character and the contents of $IFS isn't enough, though. If you are faced with this, then regex parsing is your best option.

Parsing with Regex's

Parsing with regex's is a common approach for complex input data. Parsing with regex's in Bash is a bit clunky, and it looks like this:

while IFS= read; do
   [[ $REPLY =~ systemd\[([^:]*)\]:\ (.*)$ ]] || continue
   full_match="${BASH_REMATCH[0]}"
   pid="${BASH_REMATCH[1]}" 
   msg="${BASH_REMATCH[2]}"
   echo "systemd message: pid= '$pid', msg= '$msg'"
done </var/log/syslog

As you can see, fishing the tokens out of $BASH_REMATCH[] is a boilerplate chore. In the example it is noteworthy that $IFS is set to nothing, to prevent 'read' from splitting tokens. $REPLY is used to retrieve the record; this is the documented default return variable if none are supplied.

Parsing with 'regex_read'

In a previous post I introduced the Bash++ return stack facility. Now I will use that facility to eliminate some drudgery:

#!/bin/bash
############################################################
# Example script to demonstrate bash++ regex_read function
#
# John Robertson <john@rrci.com>
# Initial release: Mon Sep 14 10:29:20 EDT 2020
#

# Halt on error, no globbing, no unbound variables
set -efu

# import oop facilities and other goodies
source ../bash++

###################################
### Execution starts here #########
###################################

# Open file to supply input on file descriptor $FD.
# Use recent bash syntax to assign next unused file descriptor to variable FD.
# We do this so all standard streams remain available inside of loop for
# interactive commands.
exec {FD}</var/log/syslog

# Loop until no more data available from $FD
# Regex matches 'systemd' syslog entries, breaks out date stamp, pid, and message
while regex_read '^([^ ]+ [^ ]+ [^ ]+) .*systemd\[([^:]*)\]: (.*)' -u $FD; do

   # First fetch the number of matches from the return stack.
   RTN_pop n_matches

   # Not interested in less than perfect match
   (( n_matches == 4 )) || continue

   # Pop match results into variables
   RTN_pop full_match dateStamp pid msg
   # Clear the terminal
   clear
#  "Full match is: '$full_match'"
   echo "systemd message: pid= '$pid', dateStamp= '$dateStamp',  msg= '$msg'"

   # Use builtin bash menuing to branch on user's choice
   PS3='Action? '
   select action in 'ignore' 'review' 'quit'; do

      case $action in

         ignore) ;; # no worries

         review) read -p 'Chase up all relevant information, present to user. [Return to continue] ';;

         quit) exit 0;;

      esac

      # go get another line from syslog
      break
   done # End of 'select' menu loop

done

There are several points of interest in this example:

The input file is opened and assigned an unused file descriptor, which is placed in $FD. This leaves the shell's stdin available for user interaction within the 'regex_read' loop.
'regex_read' syntax is just like 'read', except that the regex itself must be the first argument; all arguments after the first are passed through to 'read'. Match results will all be placed on the return stack.
The builtin 'select' command is used within the loop to present a menu to the user and branch on the user's choice.
With the return stack it is possible to accomodate a variable number of complex return values from a function.

'regex_read' Implementation

The implementation of 'regex_read' is simple, as it is essentially derived from 'read' itself. Here it is:

function regex_read ()
############################################################
# Similar to bash 'read' builtin, but parses subsequent
# read buffer using the supplied regular expression.
# Arguments:
#   regex pattern
# Returns:
#   logical TRUE if read was successful
#   or logical FALSE on end-of-file condition.
# Return stack:
#   Full string match (if any)
#   token1_match (if any)
#   ...
#   Last argument is _always_ number of matches found. Pop it first.
#   
{
   # Stash the regular expression
   local ndx count regex="$1"

   # All other args are for 'read'
   shift

   # Call read with other supplied args. Fails on EOF
   IFS= read $@ || return 1

   # Apply regular expression parsing to read buffer
   if [[ $REPLY =~ $regex ]]; then
      # Place results on return stack
      count=${#BASH_REMATCH[@]}
      for (( ndx= 0; ndx < count; ++ndx )); do
         RTN_push "${BASH_REMATCH[$ndx]}"
      done
      # Last stack arg is number of match results
      RTN_push $count

   else
      # regex failed to match
      RTN_push 0
   fi
}

Conclusion

Bash can parse data using regular expressions. This task is made much simpler by using the 'regex_read' function, which is imported by source'ing bash++. All files in the post are available from the same Github repository.