This article was authored by Mehvish Poshni, a member of Educative's technical content team.
For many of us, our first exposure to a programming language is a general-purpose programming language like C, Python, or Java. AWK, on the other hand, was designed with a very targeted goal of being able to process text-based data without having to write several lines of code. That's not to say that AWK is limited to performing this function alone—far from it. However, effectiveness of AWK is largely due to the control it offers over writing these quick one-liner command-line programs, as well as short scripts to serve an immediate need. Imagine having the prowess to manipulate system logs, configuration files and spreadsheet data from the command line in just a few keystrokes. Another reason why it's worthwhile learning AWK is because it comes pre-installed as the utility awk
on Unix-like operating systems, and its inclusion into the Unix ecosystem makes it very convenient to use.
The letters A, W, and K in AWK stand for the last names of the individuals (Alfred Aho, Caspar Weinberger, and Brian Kernighan) who designed the programming language in the late 1970s.
Note: This blog assumes a passing familiarity using a command-line shell (
cat
,echo
, pipe, and redirection), some prior programming exposure (concepts like comparison and logical operators, expressions, conditionals, and loops).
Input structure for an AWK program
An AWK program takes input either in the form of one or more text files, or as the standard input stream coming from the shell environment in which awk
executes. The default behavior is that each line in the input stream is considered one record, and each record has fields (text) separated by one or more whitespace characters (spaces or tabs). This default behavior can be overridden easily.
The coding environments included in this blog make use of an input text file. For convenience, we show the file here in a tabular format:
Code format:(inputfile
)
Colton Dominguez 28 Feb-20-2021 33 Marketing No
Megan Porter 29 Dec-03-2021 81 Engineering Yes
Candace Walsh 25 Apr-14-2023 43 Sales No
Grady Clements 40 Feb-15-2023 36 Sales Yes
Macaulay Roy 33 Jul-11-2022 63 Engineering Yes
Abraham Strickland 31 Aug-25-2022 93 Marketing Yes
Joelle Higgins 42 Sep-23-2022 89 Engineering Yes
Note: The input file may not necessarily have the same number of fields on each line.
An unusual workflow
The manner in which an AWK program runs is unusual because when it's run, the code is repeatedly executed for each record in the input — behind the scenes.
Whenever a record is read, there are special built-in variables $1
, $2
, $3
(and so on) that can be used for accessing the values in the first, second, third (and so on) fields of that record. The entire record can also be retrieved all at once using the built-in variable $0
.
An AWK program
A basic AWK program consists of one or more pattern-action pairs in the following general form.
pattern { action }
- The
pattern
is an expression that evaluates to a value that's regarded as true or false.
Note: AWK does not have a boolean data type, but
0
and the empty string""
are regarded as false, and all other values as true.
- The
action
consists of one or more statements. In case there are multiple statements within anaction
, they may be separated by either a semicolon character (;
) or a newline.
When running an AWK program, each pattern
is tested against every record of the input stream, one by one.
Whenever the pattern
evaluates to true, the corresponding action
is executed.
The entire program is enclosed within single quotes, and can be run from the command line using the awk
utility:
awk 'pattern { action }' inputfile
Here, inputfile
is the input to the program. More than one file can also be passed as input.
awk 'pattern { action }' file1 file2
Since we are running the program using the awk
utility in the shell, output can be stored in a file using the redirection operator >
.
awk 'pattern { action }' inputfile > outputfile
In the same vein, the input can also be taken using the pipe operator.
cat inputfile | awk 'pattern { action }'
Examples
1- In the following one-liner, we print the records for which the age (in the third column) is less than 30.
awk '$3 < 30 { print $0 }' inputfile
Output:
Colton Dominguez 28 Feb-20-2021 33 Marketing No
Megan Porter 29 Dec-03-2021 81 Engineering Yes
Candace Walsh 25 Apr-14-2023 43 Sales No
2- See how none of the records get printed when 0
or the empty string ""
is used as a pattern.
cat inputfile | awk '0 { print $0 }'
awk '"" { print $0 }' inputfile
Output: "Success"
3- The pattern in the following snippet is a non-empty string (from the second column in inputfile
) which is considered true. So the action is executed for all records.
Observe, also, how we can concatenate different strings by placing them side by side.
awk '$2 { print $2 ", " $1 }' inputfile > outputfile
cat outputfile # To display the contents of the outputfile
Output:
Dominguez, Colton
Porter, Megan
Walsh, Candace
Clements, Grady
Roy, Macaulay
Strickland, Abraham
Higgins, Joelle
Patterns and actions are optional
It isn't necessary to specify both pattern
and { action }
. Just one of them suffices:
- When
pattern
is not specified, theaction
is performed for all the records.
awk '{ print $1 "." $2 "@educative.io" }' inputfile
Output:
Colton.Dominguez@educative.io
Megan.Porter@educative.io
Candace.Walsh@educative.io
Grady.Clements@educative.io
Macaulay.Roy@educative.io
Abraham.Strickland@educative.io
Joelle.Higgins@educative.io
- When
{ action }
is not specified, the default action is to print all the matched records.
awk '$6 == "Marketing" || $7 == "No"' inputfile
Output:
Colton Dominguez 28 Feb-20-2021 33 Marketing No
Candace Walsh 25 Apr-14-2023 43 Sales No
Abraham Strickland 31 Aug-25-2022 93 Marketing Yes
The BEGIN
and END
patterns
There are other ways to specify a pattern than creating expressions using numbers, strings, arithmetic or logical operators.
- The pattern
BEGIN
is matched with the beginning of the input file. So, its associated action is executed in the beginning before any other record is read. It makes sense to use it for tasks like initializing variables. - The pattern
END
matches the end of the file and is executed once at the end of the input file.
Think about how to add the scores listed in the fifth column of the input file.
awk 'BEGIN { sum = 0 }
{ sum += $5 }
END { print "Sum of scores is " sum }' inputfile
Output:
Sum of scores is 438
Regular expressions as patterns
Regular expressions (regex) are symbolic ways to represent a pattern, and specify what the matching text should look like.
The syntax of regular expressions used in AWK is known as the Extended Regular Expression (ERE).
This syntax is also used by many other languages and unix-based utilities. So, it's super useful to know.
Using a regex
In AWK, when specifying a regex as a pattern, we can include it between two forward slashes. The simplest form of a regex is as a plain sequence of characters. For example, the pattern /Feb/
matches all records containing the text Feb
.
awk '/Feb/' inputfile
Output:
Colton Dominguez 28 Feb-20-2021 33 Marketing No
Grady Clements 40 Feb-15-2023 36 Sales Yes
Instead of searching the entire record for a match against a regex, we can use the operator ~
to check if a regular expression matches a smaller portion of the given text. Similarly, the operator !~
is useful for checking if there is no match.
Usage: The regular expression must appear on the right of
~
or!~
, and the text being searched must go on the left.
awk '$6 ~ /Sa/' inputfile # Sa present in the 2nd last column
echo " "
awk '$(6+1) !~ /Y/' inputfile # Absence of Y in the last column
Output:
Candace Walsh 25 Apr-14-2023 43 Sales No
Grady Clements 40 Feb-15-2023 36 Sales Yes
Colton Dominguez 28 Feb-20-2021 33 Marketing No
Candace Walsh 25 Apr-14-2023 43 Sales No
Regex metacharacters
A regular expression may include some special characters called metacharacters, so called because they are not matched with a text in a literal sense. Instead they are interpreted as a rule for matching text. Here are some examples:
- The metacharacters
[
and]
match one of (possibly) many characters that appear enclosed within the brackets. For example,[AbC]
means a single character: eitherA
,b
, orC
. Expressions like these are called character classes. - A range of characters can also be represented as character classes. For example:
-
[0-9]
means a single numeric character from $0$ to $9$. -
[a-zA-Z]
means a single alphabetical character in upper or lower case.
-
- The metacharacters
[^ ]
specify a single character other than the ones appearing after the symbol^
inside the character class. For example,[^bcd]
means any character other thanb
,c
, ord
.
# score column contains a number in the 20-25 range
awk '$5 ~ /[20-25]/ { print $1 " scored in the 20 to 25 range" }' inputfile
echo " "
# 2022 or 2023 not present in the 4th column
awk '$4 ~ /202[^23]/ { print "Joining year of " $1 " is neither 2022 nor 2023" }' inputfile
Output:
Megan scored in the 20 to 25 range
Joining year of Colton is neither 2022 nor 2023
Joining year of Megan is neither 2022 nor 2023
The metacharacters $
and ^
(outside a character class) take on meaning relative to some other character X
in the following way:
-
^X
means lines that start withX
. -
X$
means lines that end withX
.
awk '/^J/' inputfile # Lines that start with J
echo " "
awk '/o$/' inputfile #Lines that end with o
Output:
Joelle Higgins 42 Sep-23-2022 89 Engineering Yes
Colton Dominguez 28 Feb-20-2021 33 Marketing No
Candace Walsh 25 Apr-14-2023 43 Sales No
- The metacharacter
.
means any single character. - The metacharacter
|
means characters specified by the regex on its left or its right. For example,ab|[cd]
matches eitherab
,c
ord
. - The metacharacters
()
are used for grouping characters. For example,^M
versus^(Me)
mean two different things (lines beginning withM
versus lines beginning withMe
).
Note: The GNU implementation of AWK, known as GAWK, also supports additional features including the use of metacharacters
()
for capturing portions of matched text for later use.
awk '/(M..a)/' inputfile # Matches substrings of Megan and Macaulay
echo " "
awk '/(D|P)o/' inputfile # Matches substrings of Dominguez and Porter
Output:
Megan Porter 29 Dec-03-2021 81 Engineering Yes
Macaulay Roy 33 Jul-11-2022 63 Engineering Yes
Colton Dominguez 28 Feb-20-2021 33 Marketing No
Megan Porter 29 Dec-03-2021 81 Engineering Yes
The metacharacters *
, +
, ?
, {m,n}
are called quantifiers. They also take on meaning relative to their preceding character, say X
:
- The expression
X*
means zero or more occurrences ofX
. - The expression
X+
means one or more occurrences ofX
. - The expression
X?
means zero or one occurrence ofX
. - The expression
X{n,m}
means at leastn
and at mostm
occurrences of X. (This is not supported below.)
awk '/i[g]+/' inputfile # Colton Mscaulsy Joelle
echo " "
awk '/oe?l/' inputfile # Joelle and Colton
Output:
Joelle Higgins 42 Sep-23-2022 89 Engineering Yes
Colton Dominguez 28 Feb-20-2021 33 Marketing No
Joelle Higgins 42 Sep-23-2022 89 Engineering Yes
Note: To match a metacharacter literally, we need to use the escape character
\
. For example/\*/
to match the character*
.
Data structure: Associative array
An associative array is the only data structure supported by AWK. It essentially consists of index and value pairs, where the index can be used for retrieving the corresponding value.
An associative array is created simply through an assignment statement that maps a value to an index. The syntax looks like this:
arr["ind"] = "val"
We can also add more elements to an array using assignment statements like the one above.
awk '{ arr[$1] = $3 }
END{ for (i in arr )
{
print i " " arr[i]
}
}' inputfile
Output:
Grady 40
Macaulay 33
Megan 29
Colton 28
Joelle 42
Candace 25
Abraham 31
Notice how we loop over the array arr
using a for(i in arr)
style loop. In each round, i
is set to the index of an element in arr
(and not an element in arr
).
AWK also supports a C-style for
loop (see exact syntax below), but it isn't suitable for traversing over an associative array because the keys of an associative array may not fall in the required range of numbers.
for(i = 1; i < 10; i++)
Here's another example where the number of individuals in each team count is computed.
awk '!arr[$6] { arr[$6] = 0 }
{ arr[$6] += 1 }
END {
for (i in arr)
{
print i " : " arr[i]
}
}' inputfile
Output:
Marketing : 2
Sales : 2
Engineering : 3
Built-in variables and functions
Other than $0
, and $1
, $2
, $3
etc., there are other built-in variables that are easy to remember and easy to use. Some of these are shown in the following table:
AWK supports many predefined mathematical functions (like log
, sqrt
, exp
, sin
) as well as functions for working with strings (such as substr
, length
, toupper
).
Let's see a few more examples before we call it a day.
Example 1: Overriding default values
We can use any character as a field separator in the output by changing the default value of the variable OFS
. The default values for OFS
can be overridden as shown below.
Also note how, in the following example, we print the record number for each row using the variable NR
(for number of records).
awk '{ print NR, $1, $5 }' OFS=, inputfile
Output:
1,Colton,33
2,Megan,81
3,Candace,43
4,Grady,36
5,Macaulay,63
6,Abraham,93
7,Joelle,89
Example 2: Accessing fields using rvalues
If a variable varname
is assigned an integer k, then the syntax $varname
can be used for accessing the fields in the k^{th} column.
For example, since NF
stores the number of fields in the current record, we can access the last field in that row using the syntax $NF
.
awk '{ print NR, $(NF-1), $NF }' inputfile
Output:
1 Marketing No
2 Engineering Yes
3 Sales No
4 Sales Yes
5 Engineering Yes
6 Marketing Yes
7 Engineering Yes
Example 3: Formatting output
The C style printf
is used for showing the output formatted in a tabular form. The argument %-20s
sets the width of the padded string at 20& characters and aligns it to the right.
awk 'BEGIN { printf "%-20s | %-5s\n", "Full Name", "Score" }
{ printf "%-20s | %-5d\n", $1 " " $2, $3 }' inputfile
Output:
Full Name | Score
Colton Dominguez | 28
Megan Porter | 29
Candace Walsh | 25
Grady Clements | 40
Macaulay Roy | 33
Abraham Strickland | 31
Joelle Higgins | 42
The next two examples use built-in functions.
Example 4: Splitting a string
The built-in function split(str, arr, ch)
is used, which splits the string str
around the character ch
and stores the resulting substrings in the array arr
. We use this function below to extract the month and year from each individual's joining date.
awk '{ split($4, arr, "-");
printf "%-10s | %-10s\n", $1 , arr[1] " " arr[3] }' inputfile
Output:
Colton | Feb 2021
Megan | Dec 2021
Candace | Apr 2023
Grady | Feb 2023
Macaulay | Jul 2022
Abraham | Aug 2022
Joelle | Sep 2022
Example 5: Find and replace
The function gsub(regex,subst,str)
looks for all matches made by the regular expression regex
in the string str
, and replaces it by string subst
. The g
in gsub
is for "global". There's also a related function sub
(for replacing a single occurrence).
awk '{ gsub(/[0-9]+/,"X",$0); print }' inputfile
Output:
Colton Dominguez X Feb-X-X X Marketing No
Megan Porter X Dec-X-X X Engineering Yes
Candace Walsh X Apr-X-X X Sales No
Grady Clements X Feb-X-X X Sales Yes
Macaulay Roy X Jul-X-X X Engineering Yes
Abraham Strickland X Aug-X-X X Marketing Yes
Joelle Higgins X Sep-X-X X Engineering Yes
Example 6: Bigger programs
In AWK programs, we can use many constructs similar to the ones available in other languages like if
, else
, while
, switch
, and more. One can also define a function in an AWK program, and then call it from within the scope of an action.
awk 'BEGIN { max = -1; name = "" }
{
if (max < $5)
{
max = $5
name = $1
}
}
END { print name ": " getMaxScore() }
function getMaxScore() { return max }' inputfile
Output:
Abraham: 93
A final word
It's worth noting that AWK is a Turing complete language, which means that it can be utilized for implementing any algorithm. That being said, AWK is primarily useful for tasks like data filtration and manipulation.
This blog is far from being a complete tutorial, but we hope that it is effective in removing any entry level barriers for a faster and a happier learning experience.
Happy learning!
Top comments (0)