loading...
Cover image for teip: "Masking tape" for Shell is what we needed

teip: "Masking tape" for Shell is what we needed

greymd profile image Yasuhiro Yamada Updated on ・10 min read

Question

The following is part of the /var/log/secure file on my server.
The entire size of the original file is several hundred MiB.

$ cat test_secure
May 26 03:19:26 localhost sshd[17872]: Received disconnect from 192.0.2.152 port 29864:11:  [preauth]
May 26 03:19:26 localhost sshd[17872]: Disconnected from 192.0.2.78 port 29864 [preauth]
May 26 03:21:10 localhost sshd[17927]: Invalid user amavis1 from 192.0.2.148 port 53364
May 26 03:21:10 localhost sshd[17927]: input_userauth_request: invalid user amavis1 [preauth]
...

I had to convert the datetime at the beginning of the line to UNIX time (don't ask me why) while logged in with SSH.
As shown below is what I wanted.

1590459566 localhost sshd[17872]: Received disconnect from 192.0.2.152 port 29864:11:  [preauth]
1590459566 localhost sshd[17872]: Disconnected from 192.0.2.78 port 29864 [preauth]
1590459670 localhost sshd[17927]: Invalid user amavis1 from 192.0.2.148 port 53364
1590459670 localhost sshd[17927]: input_userauth_request: invalid user amavis1 [preauth]
...

I would like to finish this kind of work EASILY on the terminal.

So, how would you get this done?

I've prepared a sample file (extracted one is 100 MiB), so if you're interested, please give it a try.

Overview of this article

  • Typical UNIX commands and Shell cannot easily allow you to modify the input/file partially.
  • If you try to accomplish this "partial modification", more often than not, you'll complete a complex "script" or you'll only get an "one-liner" having terrible performance.
  • teip can be the "masking tape" against the pipes.
  • With teip, you'll make such the task done easily.
  • The one-liner with teip can work faster than other single thread commands.

Answers for the question

Back to the story, I'll show you the most common answers to the questions.

Answer 1: while statement

If you notice that the datetime can be parsed with the date command, you can use the while statement of Bash to read and format each line as shown below.

$ cat test_secure | while read -r m d t rest; do echo "$(date -d "$m $d $t" +%s) $rest" ; done
...
1590463166 localhost sshd[17872]: Disconnected from 192.0.2.78 port 29864 [preauth]
1590463270 localhost sshd[17927]: Invalid user amavis1 from 192.0.2.148 port 53364
...

However, this is a bad practice that beginners tend to do. This method is extremely slow. It is only a few tens of "KiB" per second in my environment (t3.medium instance on AWS). Let's measure the speed with the pv command1.

$ cat test_secure | while read -r m d t rest; do echo "$(date -d "$m $d $t" +%s) $rest" ; done | pv > /dev/null
 329KiB 0:00:04 [84.7KiB/s] [...]

Let's say the processing speed is 100 KiB/s, it will take 17 minutes to complete a 100 MiB file. If the file is 1GiB, it will take 175 minutes!
This is not good for the environment because it requires much energy to convert😢🌎 , and it may cause the SSH session to run out.

This is not just a problem of while statement. All the ways to make the date fork every line have essentially the same problem. Calling the date from xargs, awk, or perl, for example, is similarly slow.

Answer 2: Separate, Convert and Merge

OK, let's try another idea. Extract the datetime first, and convert them into UNIX time, and save the result to seconds.

$ cat test_secure | cut -c 1-15 | date -f- +%s > seconds
$ less seconds
1590463166
1590463166
1590463270
...

After that, combine them with the rest of the file.

$ paste -d ' ' seconds <(cut -c 17- test_secure)
...
1590463166 localhost sshd[17872]: Disconnected from 192.0.2.78 port 29864 [preauth]
1590463270 localhost sshd[17927]: Invalid user amavis1 from 192.0.2.148 port 53364
...

It's a bit of a hassle, but if you have plenty of disk space, this is a good way to go. Furthermore, this method is relatively fast.

But in real-life situations, this method may take more effort than expected.
Typically, huge log files are often partially corrupted like this.

May 22 01:15:17 localhost sshd[27705]: ...
May 22 01:15:17 localhost sshd[27705]: ...
: Connection closed by 192.0.2.125 port 27258 [preauth] !!!!!!!!!!! Broken line
May 22 01:15:17 localhost sshd[27707]: ...

Bit flipping and undefined behavior on the software, OS, memory, and file system can easily create such a file. For example, HTTP logs provided by NASA is in a similar situation.

Who can guarantee the correspondence between seconds and the original file2?

$ cat test_secure_broken | cut -c 1-15 | date -f- +%s > seconds

$ wc -l test_secure_broken
1078333 test_secure_broken

$ wc -l seconds
1078331 seconds    ## <==== Different number!

In many cases, this method risks wasting a lot of time and effort on file normalization. Besides, this idea is less generic because it cannot apply to Apache's log (the datetime is NOT at the beginning of the line) like NASA's one.

Answer 3: sed, awk, perl, etc..

If you are familiar with the command line, you can think of a way to convert it using the built-in features of an interpreter language such as awk, sed, perl, or ruby. It's probably the best answer so far, but this way involves a lot of hard work.

First, let's use the one called awk3.

It would be better to use mktime. If you give numbers following "YYYY MM DD HH MM SS" format, you will get the UNIX time.

$  awk 'BEGIN{s=mktime("2020 01 01 01 01 01");print s}'
1577840461

OK, it is enough simple so far. I searched on Google 4 and found that the name of the month (Jan, Feb, ...) can't be converted to digits by awk, it seems. So, I need a small code to convert it first.

I found a good sample on the Stack Overflow.

echo "Feb" | awk '{printf "%02d\n",(index("JanFebMarAprMayJunJulAugSepOctNovDec",$1)+2)/3}'

It looks like we can convert month name to the number in this way. This one-liner is what I created finally.

$ cat test_secure | awk -F'[ :]+' '{m=(index("JanFebMarAprMayJunJulAugSepOctNovDec",$1)+2)/3;s = mktime(2020" "m" "$2" "$3" "$4" "$5);$2="";$3="";$4="";$5="";$1=s;print }' | awk '{$1=$1;print}'
...
1590463166 localhost sshd[17872] Disconnected from 192.0.2.78 port 29864 [preauth]
1590463270 localhost sshd[17927] Invalid user amavis1 from 192.0.2.148 port 53364
...

Some character is lost from the original file (all the colons : are removed from file). But it is ok for me. The speed seems to be satisfactory, and it took about 9 seconds to process a 100 MiB file. But it is a bit of a long "one-liner".

So, let's try it in perl. You may need an external module, but it's just a little shorter.

$ cat test_secure | perl -MTime::Piece -anle 'my $t = Time::Piece->strptime("$F[0] $F[1] $F[2] 2020", "%b %d %H:%M:%S %Y");printf $t->epoch; print " @F[3..$#F]";'

The above ways have the same problems as Answer 2 and do not preserve the integrity of the file. However, you can use a regular expression to convert the date having a proper format. But to do so, you may need to write even longer code.

Anyway, can you do the above way EASILY during logging into the server with SSH? For example, do you write the above code for every log of not only /var/log/secure but also Apache, etc.? This is not a one-liner, but a script, and it will kill your SSH session while you're staring at Stack Overflow.

Problems between Shell and UNIX commands

The question exposes the weakness of the Shell and UNIX commands.

Problem 1

Typical UNIX commands are connected using Shell's pipes.

  • The command takes the ENTIRE standard input and processes it.
  • The pipe passes the ENTIRE standard input to the command.

That's why Shell is called a "glue language" sometimes. This idea is very powerful, but at the cost of it, each command cannot "choose what to process". As a result, most commands cannot handle the files or the input "partially".

Problem 2

On the other hand, some commands can "choose what to process". They are sed, awk, perl, etc. In other words, if the solution requires partial modification of the data, you have to use those interpreters.

Also, when using these commands, "choose what to process" and "process the data" must be done in the same world provided by the command.

## Choose the lines between particular two lines (one includes AAA, another one includes BBB)
## ..and modify the chosen lines.
... | sed '/AAA/,/BBB/{ s/hoge/fuga/g }'

## Choose 1st, 3rd columns
## ..and modify them
... | awk '{gsub("A","B",$1);gsub("B","C",$3);print}'

## Choose the part matched with the regex
## ..and modify the part
... | perl -nle '/^(......)(...)/; $a=$1;$b=$2; $b =~ s/./@/g; print "$a$b"'

That forces you to make a "script" (that is no longer the "one-liner" in many cases). Because in this world, you have to follow the restriction of the interpreter.

Problem 3

There is a way to create the one-liner as simply as possible without involving the above complications. It's calling the command from within the while statements or the interpreter.

But this way exposes another problem. If the processed data is large, it involves a lot of forks of processes like Answer 1. It may extremely impair the performance as mentioned above.

teip: Masking-tape for Shell

When I was thinking about the above issue, I thought...

"""
Isn't there a way to specify the area to be glued when gluing commands together?
Wouldn't the problems be simpler if we had something like "Masking tape" for example?
"""

As far as I could find, there was no command for such a mechanism.
So I took a long break to learn Rust and created a command called teip (called "téɪp", I assume).

Let's quickly show you the solution using the teip command.
This question can be solved as follows.

$ cat test_secure | teip -c 1-15 -- date -f- +%s
...
1590463166 localhost sshd[17872] Disconnected from 192.0.2.78 port 29864 [preauth]
1590463270 localhost sshd[17927] Invalid user amavis1 from 192.0.2.148 port 53364
...

It's quite simple, isn't it?

This command cuts out only 15 characters from the beginning of the input and passes it to the date command. That is, the string after 15 characters is not visible to the date command (that part is covered by masking tape and the glue is ineffective). Then, teip replace those 15 characters with the result of the date command.

The integrity of the data? Good question. When any inconsistency occurs, teip will print the error and terminate the operation.
See the manual in detail.

Besides, the above example is the simplest one. If you match the datetime using a regular expression like the one below, you won't get any data inconsistencies in this example.

$ cat /var/log/secure | teip -og '^[A-Z]\w\w +\d\d? \d\d:\d\d:\d\d' -- date -f- +%s

Are you worried about the speed? This is also a good question.
Now let's measure it.

$ sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' ## Clear page cache
$ time cat test_secure | teip -og '^[A-Z]\w\w +\d\d? \d\d:\d\d:\d\d' -- date -f- +%s | pv >/dev/null
94.9MiB 0:00:08 [11.6MiB/s] [...]

real    0m8.185s
user    0m2.791s
sys     0m0.353s

$ sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
$ time cat test_secure | awk -F'[ :]+' '{m=(index("JanFebMarAprMayJunJulAugSepOctNovDec",$1)+2)/3;s = mktime(2020" "m" "$2" "$3"
"$4" "$5);$2="";$3="";$4="";$5="";$1=s;print }' | awk '{$1=$1;print}' | pv >/dev/null
93.3MiB 0:00:09 [9.44MiB/s] [...]

real    0m9.889s
user    0m9.614s
sys     0m2.452s

$ sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
$ time cat test_secure | perl -MTime::Piece -anle 'my $t = Time::Piece->strptime("$F[0] $F[1] $F[2] 2020", "%b %d %H:%M:%S %Y");p
rintf $t->epoch; print " @F[3..$#F]";' | pv > /dev/null
94.8MiB 0:00:13 [7.15MiB/s] [...]

real    0m13.290s
user    0m12.807s
sys     0m0.391s
  • teip + date : 8.185s
  • awk : 9.889s
  • perl : 13.290s

As you can see, teip is faster than all the other examples in my environment. As you'll see later, teip is a very high performer.

What teip technically does?

Please check README.md for more information.

How useful?

This command can be applied to any problem that requires "partial modification" such as modifying the Apache logs, the specific columns in the CSV or TSV file, only the contents of <body> in the HTML file, etc. Here are some examples introduced on the GitHub.

  • Edit 4th and 6th columns in the CSV file
$ cat file.csv | teip -d, -f 4,6 -- sed 's/./@/g'
  • Percent-encode bare-minimum range of the file
$ cat file | teip -og '[^-a-zA-Z0-9@:%._\+~#=/]+' -- php -R 'echo urlencode($argn)."\n";'

Advantages of teip

teip allows another command to "choose what to process". There are two benefits in this way.

First, any command such as grep, date, base64, iconv, fold, nfk, rev, tac etc, etc.. can now be used for partial modification, without interpreters and while statement! This makes your daily works more simple ones.

Second is performance enhancement. Let me show you the results of the quite interesting benchmark. Here is the comparison of processing time to replace approx 761,000 IP addresses with dummy string in 100 MiB /var/log/secure file.

Benchmark result

Believe it or not, the performance of GNU sed/awk together with teip is better than the same ones which run in a single thread - more than twice as much. If you are interested in, see Wiki > Benchmark.

The reason for this is that the "choose what to process" can be done in a separate thread in parallel. As a result, "Masking-taped" commands become faster because they only need to handle a limited amount of input. In this benchmark, the sed became faster because the number of backtracks performed by the regular expression is reduced.

With teip, you can easily get the troublesome work done that you couldn't do with the traditional UNIX commands and Shell environment before 😄

If you have any questions or pull requests, always welcome. Let's keep the environment together 🌎


  1. https://www.man7.org/linux/man-pages/man1/pv.1.html 

  2. The broken sample file is prepared here

  3. Pardon me, I am not familiar with it. 

  4. By the time I looked for this page, I had seen a number of samples in awk that used the date command. How do they do it when they're dealing with huge files? 

Discussion

markdown guide
 

A good tool adhering to the Unix philosophy.
Teip makes writing shell scripts much nicer and cleaner.
It is a tool that was missing in my life but I didn't know exactly what I was missing. It is always great when new powerful but simple tools are built that actually make peoples lives easier.

Thank you very much for the work and the article.