DEV Community

Igor Irianto
Igor Irianto

Posted on • Updated on

Introduction to Awk

Awk is a processing language for data reading and manipulation. If you need to quickly process a text pattern inside a file, especially if your file contains rows and columns, awk might be the tool for the job.

Let's see some examples.

This command line kills the process running on localhost:3000 (don't worry trying to understand the code below. I will go over it later):

lsof -i:3000 | awk '/LISTEN/ {print $2}' | xargs kill -9
Enter fullscreen mode Exit fullscreen mode

Let's do something simpler:

awk '{print}' server.rb
Enter fullscreen mode Exit fullscreen mode

Displays file content, similar to cat server.rb. Awk also makes it easy to add filter. If you want display only lines that contains the word "run", you can do:

awk '/run/ {print}' server.rb
Enter fullscreen mode Exit fullscreen mode

Very powerful. I am not even scratching the surface of the awk iceberg.

Basic Syntax

Awk's basic syntax is:

awk 'pattern {action}' file
Enter fullscreen mode Exit fullscreen mode

One important action is print. Let's do some examples with print; I will go over pattern later.

For this, let's create a file called awk.ward (pun intended):

echo 'Awk. Or do not awk. There is no try' > awk.ward
Enter fullscreen mode Exit fullscreen mode

To get the content of file, we can do:

awk '{print $0}' awk.ward
Enter fullscreen mode Exit fullscreen mode

Let's try another print variation, this time we will hard code it:

awk '{print "Hello awk!"}' awk.ward
Enter fullscreen mode Exit fullscreen mode

This prints "Hello awk!" regardless of what the file content is.

Fields

Earlier we saw:

awk '{print $0}' awk.ward
Enter fullscreen mode Exit fullscreen mode

You may wonder, what $0 is. In awk,$0 represents the whole record match. Usually it is the entire line. You can do the same with a simple print statement ({print $0} is the same as {print}).

In addition, awk also captures different "fields" in a line. By default, it is delimited by space and tabs. Let's check out the fields:

awk '{print $0}' awk.ward
awk '{print $1}' awk.ward
awk '{print $2}' awk.ward
awk '{print $9}' awk.ward
Enter fullscreen mode Exit fullscreen mode

My awk.ward file contains 1 line and 9 fields (each separated by space). If you ask awk to print fields higher than what awk captures (like field 10), it returns empty:

awk '{print $10}' awk.ward
Enter fullscreen mode Exit fullscreen mode

You can change the delimiter with -F. In this case, we want to capture each field separated by ., not space. To tell awk to separate it with ., we use -F.:

awk -F. '{print $1}' awk.ward
awk -F. '{print $2}' awk.ward
awk -F. '{print $3}' awk.ward
Enter fullscreen mode Exit fullscreen mode

You can also print multiple fields at once:

awk -F. '{print $2, $3, $1}' awk.ward 
## Or do not awk  There is no try Awk
Enter fullscreen mode Exit fullscreen mode

Pattern matching

Recall our basic awk syntax:
awk 'pattern {action}' file

Let's talk about pattern now. It accepts Basic regex rules. For example, to match any letters a-z:

awk '/[A-Za-z]+/ {print "I have string"}' awk.ward
Enter fullscreen mode Exit fullscreen mode

To match integer (it won't display anything because there is no integer inside awk.ward):

awk '/[0-9]+/ {print "I have integer"}' awk.ward
Enter fullscreen mode Exit fullscreen mode

If we create a new file, testFile.txt and inside we have:

1. This is first line
2. This is second line
3. This is third line
This is not part of the list
Enter fullscreen mode Exit fullscreen mode

If we run awk '/[0-9]+/ {print}' testFile.txt, we get:

1. This is first line
2. This is second line
3. This is third line
Enter fullscreen mode Exit fullscreen mode

Our command works as expected. It omits "This is not part of the list" because the last line does not contain any integer (/[0-9]+/).

Executing awk script from file

When our script grows too big, we can call awk command from script file.

Awk accepts -f to execute awk scripts. Let's create a script file and we will call it awk.script (you can name this anything):

## awk.script
/[0-9]+/ { print "I have integer" } 
/[A-Za-z]+/ { print "I have string" } 
Enter fullscreen mode Exit fullscreen mode

Then run it against our awk.ward file: awk -f awk.script awk.ward

You'll see "I have string". It is expected, because our test file does not contain integer.

What do you think will print if we run it against our testFile.txt?

awk -f awk.script testFile.txt
I have integer
I have string
I have integer
I have string
I have integer
I have string
I have string
Enter fullscreen mode Exit fullscreen mode

It returns what we expects. The first 3 lines contain both string and integer, so awk prints two lines for each match. The last one does not contain integer, so awk only prints string match output.

Chaining awk

In real life, I don't really use awk by itself that often. More often, I combine it with other commands.

Let's use the script earlier and break it down. Btw, if you are coding along, I have a server running on localhost:3000. Fire up a local server to see that awk actually kills it.

lsof -i:3000 | awk '/LISTEN/ {print $2}' | xargs kill -9
Enter fullscreen mode Exit fullscreen mode

Let's walk through each step. lsof -i:3000 gives:

COMMAND   PID  USER   FD   TYPE             DEVICE SIZE/OFF NODE NAME
node    48523 iggy   27u  IPv4 0xe25443d27b90583f      0t0  TCP localhost:hbci (LISTEN)
Enter fullscreen mode Exit fullscreen mode

lsof -i:3000 | awk '/LISTEN/ {print}' displays only the row with "LISTEN":

node    48523 iggy   27u  IPv4 0xe25443d27b90583f      0t0  TCP localhost:hbci (LISTEN)
Enter fullscreen mode Exit fullscreen mode

Now we need to target the 2nd "field", because that's where our PID is. Modify our script to look for "LISTEN" pattern (lsof -i:3000 | awk '/LISTEN/ {print $2}'). This returns our PID:

48523
Enter fullscreen mode Exit fullscreen mode

When we add xargs kill -9, it will pass the PID to kill -9, to terminate that PID. In this case, we need to use xargs to pipe the number so it becomes executable with kill -9. For more explanation, this SO post explains it well.

Begin, middle, end

An awk script consist of 3 parts: beginning, middle, and end. The beginning is performed once before processing any input. The middle is our main loop - everything that we've done up to this point are done in main loop. Most things in awk is done in this middle/ main loop. The end is processed once once main loop is finished.

  • BEGIN { # beginning script }
  • { # main input loop script }
  • END { # end script }

Suppose we have a file hello.txt with content:

Hello1
Hello2
Hello3
Hello4
Hello5
Hello6
Hello7
Hello8
Hello9
Hello10
Enter fullscreen mode Exit fullscreen mode

And we run this:

awk 'BEGIN {print "BEGIN"} {print} END {print "END"}' hello.txt
Enter fullscreen mode Exit fullscreen mode

We should expect 12 lines: 1 from BEGIN, 10 from main loop, and 1 from END. Our actual stdout:

BEGIN
Hello0
Hello1
Hello2
Hello3
Hello4
Hello5
Hello6
Hello7
Hello8
Hello9
Hello10
END
Enter fullscreen mode Exit fullscreen mode

Exactly what is expected.

Field Separator

Recall that we can redefine delimiter/ field separator with -F. In awk, we can redefine field separator inside our script with built-in variable FS (Field Separator). The convention is to define it inside BEGIN - right before the file is read and processed.

For example, inside greetings.txt we have a text:

Hello, how are you, sire?
Enter fullscreen mode Exit fullscreen mode

When we inspect the fields, they are separated by space.

awk '{print $1}' greetings.txt # Hello,
awk '{print $2}' greetings.txt # how
awk '{print $3}' greetings.txt # are
## ... and so on
Enter fullscreen mode Exit fullscreen mode

We want to separate them by comma. Here is how you can redefine separator:

awk 'BEGIN {FS = "," } {print $2}' myFile.txt # how are you
Enter fullscreen mode Exit fullscreen mode

Record Separator

By now, you can tell that awk performs operations line-wise. In awk, each line is a record. Each record contains multiple "fields", separated by tabs/ spaces (that we can change with -F or FS). What if we need to read chunks of multiple lines?

What if our data looks like users.txt below?

Iggy
Programmer 
123-123-1234

Yoda
Jedi Master
111-222-3333
Enter fullscreen mode Exit fullscreen mode

We need to make the lines ranging from "Iggy" to "123-123-1234" one record, lines from "Yoda" to "111-222-3333" another record. How to tell awk to chunk our data for this structure?

Luckily, awk has a "Record Separator" (RS) to do this. By default, you can guess, the default record separator is newline (\n). Let's change that:

awk 'BEGIN {FS="\n"; RS=""} {print "Name:", $1; print "Rank:", $2; print "\n"}' users.txt
Enter fullscreen mode Exit fullscreen mode

This returns:

Name: Iggy
Rank: Programmer

Name: Yoda
Rank: Jedi Master
Enter fullscreen mode Exit fullscreen mode

Which is exactly what we expected. Now all $1 contain names, $2 ranks/titles, and $3 phone numbers.

How did it work?

  • We set our Field Separator (FS) from space/tabs default into newlines (\n). Now newline marks a different field, instead of new record.
  • We set our record separator into "" from newline default.

You may ask, how does making record separator "" make chunking above work? That doesn't make sense. Shouldn't we use RS = "\n\n+" for when we have two or more newlines?

Awk, when it sees RS equals to empty string ("") it interprets it as having records separated by one or more blank lines. Apparently it is quite common to have a record separated by blank lines that awk accepts RS="".

In other word, each record now is separated by a blank line. The next record starts after blank line.

This is
a record

This is another
record
separated by blank line

This is yet another record
Enter fullscreen mode Exit fullscreen mode

For more information about this weird behavior, check out this link.

Conclusion

I think this is a good place to end. There are still much more features I didn't get to cover here: variables, conditionals, functions, etc. I will leave that for you.

Can you do what awk does with scripting language like Python or Ruby?
Definitely. But, if you need something on-the-fly, awk might be a better choice. Plus it is included in most Unix-like operating system, so you don't need to install anything.

Do you need to know awk to be a good developer?
Definitely not. I know many great developers who don't know awk. But knowing a little awk can be very helpful - it looks really cool.

Thanks for reading. Happy coding!

Resources

Top comments (1)

Collapse
 
epsi profile image
E.R. Nurwijayadi

Good article. Thank you for posting.

Awk can also be utilized to solve data structure challenge such as flatten or unique array.

To help more beginner, I have made a working example of awk with source code in github.

🕷 epsi.bitbucket.io//lambda/2021/02/...

First the data structure in a comma separated text fashioned.

Awk: Data Structure

Then the awk script to flatten array.

Awk: Flatten

And finally get the unique array:

Awk: Unique

I hope this could help other who seeks for other case example.

🙏🏽

Thank you for posting with general introduction.