Text processing in the shell

brouberol profile image Balthazar Rouberol Originally published at blog.balthazar-rouberol.com on ・19 min read

Originally posted on my blog.

Table of Contents

Text processing in the shell

One of the things that makes the shell an invaluable tool is the amount of available text processing commands, and the ability to easily pipe them into each other to build complex text processing workflows. These commands can make it trivial to perform text and data analysis, convert data between different formats, filter lines, etc.

When working with text data, the philosophy is to break any complex problem you have into a set of smaller ones, and to solve each of them with a specialized tool.

Make each program do one thing well.1

The examples in that chapter might seem a little contrived at first, but this is also by design. Each of these tools were designed to solve one small problem. They however become extremely powerful when combined.

We will go over some of the most common and useful text processing commands the shell has to offer, and will demonstrate real-life workflows piping them together. I suggest you take a look at the man of these commands to see the full breadth of options at your disposal.

The example CSV (comma-separated values) file is available online. Feel free to download it yourself to test these commands.


As seen in the previous chapter, cat is used to concatenate a list of one or more files and displays their content on screen.

$ cat Documents/readme
Thanks again for reading this book!
I hope you're following so far!
$ cat Documents/computers
Computers are not intelligent
They're just fast at making dumb things.
$ cat Documents/readme Documents/computers
Thanks again for reading this book!
I hope you are following so far!
Computers are not intelligent
They're just fast at making dumb things.


head prints the first n lines in a file. It can be very useful to peek into a file of unknown structure and format without burying your shell under a wall of text.

$ head -n 2 metadata.csv
mysql.galera.wsrep_cluster_size,gauge,,node,,The current number of nodes in the Galera cluster.,0,mysql,galera cluster size

If -n is unspecified, head will print the first 10 lines in its argument file or input stream.


tail is head’s counterpart. It prints the last n lines in a file.

$ tail -n 1 metadata.csvmysql.performance.queries,gauge,,query,second,The rate of queries.,0,mysql,queries

If you want to print all lines in a file located after the nth line(included), you can use the -n +n argument.

$ tail -n +42 metadata.csv
mysql.replication.slaves_connected,gauge,,,,Number of slaves connected to a replication master.,0,mysql,slaves connected
mysql.performance.queries,gauge,,query,second,The rate of queries.,0,mysql,queries

Our file has 43 lines, so tail -n +42 only prints the 42nd and 43rd line in our file.

If -n is unspecified, tail will print the last 10 lines in its argument file or input stream.

tail -f or tail --follow displays the last lines in a file and displays each new line as the file is being written to. It is very useful to see real time activity that is written to a log file, for example a web server log file, etc.


wc (for word count) prints either the number of characters (when using -c), words (when using -w) or lines (when using -l) in its argument files or input stream.

$ wc -l metadata.csv
43 metadata.csv
$ wc -w metadata.csv
405 metadata.csv
$ wc -c metadata.csv
5094 metadata.csv

By default, wc prints all of the above.

$ wc metadata.csv
43 405 5094 metadata.csv

Only the count will be printed out if the text data is piped in or redirected into stdin.

$ cat metadata.csv | wc
43 405 5094
$ cat metadata.csv | wc -l
$ wc -w < metadata.csv


grep is the Swiss Army knife of line filtering. It allows you to filter lines matching a given pattern.

For example, we can use grep to find all occurrences of the word mutex in our metadata.csv file.

$ grep mutex metadata.csv
mysql.innodb.mutex_os_waits,gauge,,event,second,The rate of mutex OS waits.,0,mysql,mutex os waits
mysql.innodb.mutex_spin_rounds,gauge,,event,second,The rate of mutex spin rounds.,0,mysql,mutex spin rounds
mysql.innodb.mutex_spin_waits,gauge,,event,second,The rate of mutex spin waits.,0,mysql,mutex spin waits

grep can either filter files passed as arguments, or a stream of text passed to its stdin. We can thus chain multiple grep commands to further filter our text. In the next example, we filter lines in ourmetadata.csv file that contain both the mutex and OS words.

$ grep mutex metadata.csv | grep OS
mysql.innodb.mutex_os_waits,gauge,,event,second,The rate of mutex OS waits.,0,mysql,mutex os waits

Let’s go over some of the options you can pass to grep and their associated behavior.

grep -v performs an invert matching: it filters the lines that do_not_ match the argument pattern.

$ grep -v gauge metadata.csv

grep -i performs a case-insensitive matching. In the next examplegrep -i os matches both OS and os.

$ grep -i os metadata.csv
mysql.innodb.mutex_os_waits,gauge,,event,second,The rate of mutex OS waits.,0,mysql,mutex os waitsmysql.innodb.os_log_fsyncs,gauge,,write,second,The rate of fsync writes to the log file.,0,mysql,log fsyncs

grep -l only lists files containing a match.

$ grep -l mysql metadata.csv

grep -c counts the number of times a pattern was found.

$ grep -c select metadata.csv3

grep -r recursively searches files in the current working directory and all subdirectories below it.

$ grep -r are ~/Documents/home/br/Documents/computers:Computers are not intelligent/home/br/Documents/readme:I hope you are following so far!

grep -w only matches whole words.

$ grep follow ~/Documents/readmeI hope you are following so far!$ grep -w follow ~/Documents/readme$


cut cuts out a portion of a file (or, as always, its input stream).cut works by defining a field delimited (what separates two columns)with the -d option, and what column(s) should be extracted, with the-f option.

For example, the following command extracts the first column of the last 5 lines our CSV file.

$ tail -n 5 metadata.csv | cut -d , -f 1

As we are dealing with a CSV file, we can extract each column by cutting over the , character, and extract the first column with -f 1.

We could also select both the first and second columns by using the-f 1,2 option.

$ tail -n 5 metadata.csv | cut -d , -f 1,2


paste can merge together two different files into one multi-column file.

$ cat ingredients
$ cat prices
$ paste ingredients prices
eggs 1$
milk 1.99$
butter 1.50$
tomatoes 2$/kg

By default, paste uses a tab delimiter, but you can change that using the -d option.

$ paste ingredients prices -d:

Another common use of paste it to join all lines within a stream or a file using a given delimiter, using a combination of the -s and -dargument.

$ paste -s -d, ingredients

If - is specified as an input file, stdin will be read instead.

$ cat ingredients | paste -s -d, -


sort, well, sorts argument files or input.

$ cat ingredients
$ sort ingredients

sort -r performs a reverse sort.

$ sort -r ingredients

sort -n performs a numerical sort, by sorting fields by their arithmetic value.

$ cat numbers
$ sort numbers
$ sort -n numbers


uniq detects or filters out adjacent identical lines in its argument file or input stream.

$ cat duplicates
and one
and one
and two
and one
and two
and one, two, three
$ uniq duplicates
and one
and two
and one
and two
and one, two, three

As uniq only filters out adjacent identical lines, we can still see more than one unique lines in its output. To filter out all identical lines from our duplicates file, we need to sort its content first.

$ sort duplicates | uniq
and one
and one, two, three
and two

uniq -c prepends all lines with its number of occurrences.

$ sort duplicates | uniq -c
   3 and one
   1 and one, two, three
   2 and two

uniq -u only displays the unique lines within its input.

$ sort duplicates | uniq -u
and one, two, three

uniq is particularly useful used in conjunction with sort, as| sort | uniq allows you to remove any duplicate line in a file or a stream.


awk is a little more than a text processing tool: it’s actually a whole programming language of its own. One thing awk is really good at is splitting files into columns, and it especially shines when these files contain a mix and match of spaces and tabs.

$ cat -t multi-columns
John Smith    Doctor^ITardis
Sarah-James Smith^I    Companion^ILondon
Rose Tyler   Companion^ILondon

Note: cat -t displays tabs as ^I.

We can see that these columns are either separated by spaces or tabs,and that they are not always separated by the same number of spaces.cut would be of no use there, because it only works on a single character delimiter. awk however, can easily make sense of that file.

awk '{ print $n }' prints the nth column in the text.

$ cat multi-columns | awk '{ print $1 }'
$ cat multi-columns | awk '{ print $3 }'
$ cat multi-columns | awk '{ print $1,$2 }'
John Smith
Sarah-James Smith
Rose Tyler

There is so much more we can do with awk, however, printing columnsprobably accounts for 99% of my personal usage.

{ print $NF } prints the last column in the line.


tr stands for translate, and it replaces characters into others. It either works on characters or character classes, such as lowercase,printable, spaces, alphanumeric, etc.

tr <char1> <char2> translates all occurrences of <char1> from its standard input into <char2>.

$ echo "Computers are fast" | tr a A
computers Are fAst

tr can also translate character classes by using the [:class:]notation. The full list of available classes is described in the trman page, but we’ll demonstrate some of them here.

[:space:] represent all types of spaces, from a simple space, to a tabor a newline.

$ echo "computers are fast" | tr '[:space:]' ','

All spaces-like characters were translated into a comma. Note that the% character at the end of the output represents the lack of a trailing newline. Indeed, that newline was translated to a comma as well.

[:lower:] represents all lowercase characters, and [:upper:]represents all uppercase characters. Converting between cases is thus made very easy.

$ echo "computers are fast" | tr '[:lower:]' '[:upper:]'
$ echo "COMPUTERS ARE FAST" | tr '[:upper:]' '[:lower:]'
computers are fast

tr -c SET1 SET2 will transform any character not in SET1 into the characters in SET2. The following example replaces all non vowels by spaces.

$ echo "computers are fast" | tr -c '[aeiouy]' ' '
 o  u e   a e  a

tr -d deletes the matched characters, instead of replacing them. It’s the equivalent of tr <char> ''.

$ echo "computers are fast" | tr -c '[aeiouy]' ' '
 o  u e   a e  a

tr can also replace character ranges, for example all letters between_a_ and e, or all numbers between 1 and 8, by using the notations-e, where s is the start character and e is the end one.

$ echo "computers are fast" | tr 'a-e' 'x'
xomputxrs xrx fxst
$ echo "5uch l337 5p34k" | tr '1-4' 'x'
5uch lxx7 5pxxk

tr -s string1 compresses any multiple occurrences of the characters in string1 into a single one. One of the most useful uses of tr -s is to replace multiple consecutive spaces by a single one.

$ echo "Computers         are       fast" | tr -s ' '
Computers are fast


fold wraps each input line to fit in a specified width. It can be useful to make sure an argument text fits in a small display size for example. fold -w n folds the lines at n characters.

$ cat ~/Documents/readme | fold -w 16
Thanks again for
 reading this bo
I hope you're fo
llowing so far!

fold -s will only break lines on a space character, and can be combined with -w to fold up to a given number of characters.

$ cat ~/Documents/readme | fold -w 16 -s
Thanks again
for reading
this book!
I hope you're
following so


sed is a non-interactive stream editor, used to perform text transformation on its input stream, on a line-per-line basis. It can take its output from a file our its stdin and will output its result either in a file or its stdout.

It works by taking one or many optional addresses, a function and parameters. A sed command thus looks like this:


While sed can perform many functions, we will cover only substitution, as it is probably sed's most common use.

Substituting text

A sed substitution command looks like this:


Example: replacing the first instance of a word for each line in a file

$ cat hello
hello hello
hello world!
$ cat hello | sed 's/hello/Hey I just met you/'
Hey I just met you hello
Hey I just met you world

We can see that only the first occurrence of hello was replaced in the first line. To replace all occurrences of hello in each line, we can use the g (for global) option.

$ cat hello | sed 's/hello/Hey I just met you/g'
Hey I just met you Hey I just met you
Hey I just met you world

sed allows you to specify any other separator than /, which is especially useful to keep the command readable if the search of replacement pattern contains forward slashes.

$ cat hello | sed 's@hello@Hey I just met you@g'
Hey I just met you Hey I just met you
Hey I just met you world

By specifying an address, we can tell sed on which line or line-range to actually perform the substitution.

$ cat hello | sed '1s/hello/Hey I just met you/g'
Hey I just met you hello
hello world
$ cat hello | sed '2s/hello/Hey I just met you/g'
hello hello
Hey I just met you  world

The address 1 tells sed to only replace hello by Hey I just met you on line 1. We can specify an address range with the notation <start>,<end> where <end> can either be a line number or $, meaning the last line in the file.

$ cat hello | sed '1,2s/hello/Hey I just met you/g'
Hey I just met you Hey I just met you
Hey I just met you world
$ cat hello | sed '2,3s/hello/Hey I just met you/g'
hello hello
Hey I just met you world
$ cat hello | sed '2,$s/hello/Hey I just met you/g'
hello hello
Hey I just met you world

By default, sed displays its result in its stdout, but it can also edit the initial file in-place, with the use of the -i option.

$ sed -i '' 's/hello/Bonjour/' sed-data
$ cat sed-data
Bonjour hello
Bonjour world

Note: On Linux, only -i needs to be specified. However, due to the fact that sed's behavior on macOS is slightly different, the '' needs to be added right after -i.

$ grep -w gauge metadata.csv | awk -F, '{ if ($4 == "query") { print $1, "per", $5 } }'
mysql.performance.com_delete per second
mysql.performance.com_delete_multi per second
mysql.performance.com_insert per second
mysql.performance.com_insert_select per second
mysql.performance.com_replace_select per second
mysql.performance.com_select per second
mysql.performance.com_update per second
mysql.performance.com_update_multi per second
mysql.performance.questions per second
mysql.performance.slow_queries per second
mysql.performance.queries per second

This example filters the lines containing the word gauge in our metadata.csv file using grep, then the filters the lines with the string query as their 4th column, and displays the metric name (1st column) with its associated per_unit_name value (5th column).

Printing the IPv4 address associated with a network interface

$ ifconfig en0 | grep inet | grep -v inet6 | awk '{ print $2 }'

ifconfig <interface name> prints details associated with the argument network interface name. For example:

    ether 19:64:92:de:20:ba
    inet6 fe80::8a3:a1cb:56ae:7c7c%en0 prefixlen 64 secured scopeid 0x7
    inet netmask 0xffffff00 broadcast
    nd6 options=201<PERFORMNUD,DAD>
    media: autoselect
    status: active

We then grep for inet, which will match 2 lines.

$ ifconfig en0 | grep inet
    inet6 fe80::8a3:a1cb:56ae:7c7c%en0 prefixlen 64 secured scopeid 0x7
    inet netmask 0xffffff00 broadcast

We then exclude the line with ipv6 by using a grep -v.

$ ifconfig en0 | grep inet | grep -v inet6
inet netmask 0xffffff00 broadcast

We finally use awk to get the 2nd column in that line: the IPv4 address associated with our en0 network interface.

$ ifconfig en0 | grep inet | grep -v inet6 | awk '{ print $2 }'

Extracting a value from a config file

$ grep 'editor =' ~/.gitconfig  | cut -d = -f2 | sed 's/ //g'

We look for the editor = value in the current user's git configuration file, then cut over the = sign, get the second column and remove any space around that column.

$ grep 'editor =' ~/.gitconfig
     editor = /usr/bin/vim
$ grep 'editor =' ~/.gitconfig  | cut -d'=' -f2
$ grep 'editor =' ~/.gitconfig  | cut -d'=' -f2 | sed 's/ //'

Extracting IP addresses from a log file

The following real life example looks for the message Too many connections from in a database log file (which is followed by an IP address) and displays the 10 biggest offenders.

$ grep 'Too many connections from' db.log | \
  awk '{ print $12 }' | \
  sed 's@/@@' | \
  sort | \
  uniq -c | \
  sort -rn | \
  head -n 10 | \
  awk '{ print $2 }'

Let's break down what this pipeline of command does. First, let's look at what a log line looks like.

$ grep "Too many connections from" db.log | head -n 1
2020-01-01 08:02:37,617 [myid:1] - WARN  [NIOServerCxn.Factory:] - Too many connections from / - max is 60

awk '{ print $12 }' then extracts the IP from the line.

$ grep "Too many connections from" db.log | awk '{ print $12 }'

sed 's@/@@' removes the trailing slash from the IPs.

$ grep "Too many connections from" db.log | awk '{ print $12 }' | sed 's@/@@'

Note: As we have previously seen, we can use whatever separator we want for sed. While / is commonly used as a separator, we are currently replacing that very character, which would make the substitution expression sightly less readable.

sed 's/\///'

sort | uniq -c sorts the IPs lexicographically, and then removed duplicates while prefixing IPs by their associated number of occurrences.

$ grep 'Too many connections from' db.log | \
  awk '{ print $12 }' | \
  sed 's@/@@' | \
  sort | \
  uniq -c

sort -rn | head -n 10 sorts the lines by the number of occurrences, numerically and in the reversed order, which displays the biggest offenders first, 10 of which are displayed. The final awk { print $2 } extracts the IPs themselves.

$ grep 'Too many connections from' db.log | \
  awk '{ print $12 }' | \
  sed 's@/@@' | \
  sort | \
  uniq -c | \
  sort -rn | \
  head -n 10 | \
  awk '{ print $2 }'

Renaming a function in a source file

Let's imagine that we are working a code project, and we would like to rename rename a poorly named function (or class, variable, etc) in a code file. We can do this by using sed -i, which performs an in-place replacement in a file.

$ cat izk/utils.py
def bool_from_str(s):
    if s.isdigit():
        return int(s) == 1
    return s.lower() in ['yes', 'true', 'y']
$ sed -i 's/def bool_from_str/def is_affirmative/' izk/utils.py
$ cat izk/utils.py
def is_affirmative(s):
    if s.isdigit():
        return int(s) == 1
    return s.lower() in ['yes', 'true', 'y']

Note: Use sed -i '' instead of sed -i on macOs, as the sed version behaves slightly differently.

We've however only renamed this function in the file it was defined in. Any other file we import bool_from_str will now be broken, as this function is not defined anymore. We'd need a way to rename bool_from_str everywhere it is found in our project. We can achieve just that by using grep, sed, and either for loops or xargs.

Going further: for loops and xargs

To replace all occurrences of bool_from_str in our project, we first need to recursively find them using grep -r.

$ grep -r bool_from_str .
./tests/test_utils.py:from izk.utils import bool_from_str
./tests/test_utils.py:def test_bool_from_str(s, expected):
./tests/test_utils.py:    assert bool_from_str(s) == expected
./izk/utils.py:def bool_from_str(s):
./izk/prompt.py:from .utils import bool_from_str
./izk/prompt.py:                    default = bool_from_str(os.environ[envvar])

As we are only interested in the matching files, we also need to use the -l/--files-with-matches option:

-l, --files-with-matches
        Only the names of files containing selected lines are written to standard out-
        put.  grep will only search a file until a match has been found, making
        searches potentially less expensive.  Pathnames are listed once per file
        searched.  If the standard input is searched, the string ``(standard input)''
        is written.
$ grep -r --files-with-matches bool_from_str .

We can then use the xargs command to perform an action of each line in the output (each file containing the bool_from_str string).

$ grep -r --files-with-matches bool_from_str . | \
  xargs -n 1 sed -i 's/bool_from_str/is_affirmative/'

-n 1 tells xargs that each line in the output should cause a separate sed command to be executed.

The following commands were then executed:

$ sed -i 's/bool_from_str/is_affirmative/' ./tests/test_utils.py
$ sed -i 's/bool_from_str/is_affirmative/' ./izk/utils.py
$ sed -i 's/bool_from_str/is_affirmative/' ./izk/prompt.py

If the command you call with xargs (sed, in our case) support multiple arguments, you can drop the -n 1 argument and run

grep -r --files-with-matches bool_from_str . | xargs sed -i 's/bool_from_str/is_affirmative/'

which will then execute

$ sed -i 's/bool_from_str/is_affirmative/' ./tests/test_utils.py ./izk/utils.py ./izk/prompt.py

We can see that sed can take multiple arguments by looking at its synopsis, in its man page.

     sed [-Ealn] command [file ...]
     sed [-Ealn] [-e command] [-f command_file] [-i extension] [file ...]

Indeed, as we've seen in the previous chapter, file ... means that multiple arguments representing file names are accepted.

We can see that all bool_from_str occurrences have been replaced.

$ grep -r is_affirmative .
./tests/test_utils.py:from izk.utils import is_affirmative
./tests/test_utils.py:def test_is_affirmative(s, expected):
./tests/test_utils.py:    assert is_affirmative(s) == expected
./izk/utils.py:def is_affirmative(s):
./izk/prompt.py:from .utils import is_affirmative
./izk/prompt.py:                    default = is_affirmative(os.environ[envvar])

As it is often the case, there are multiple ways of achieving the same result. Instead of using xargs, we could have used for lops, which allow you to iterate over a list of lines and perform an action on each element. These for loops have the following syntax:

for item in list; do
    command $item

By wrapping our grep command by $(), it will cause the shell to execute the it in a subshell, which result will then be iterated on by the for loop.

$ for file in $(grep -r --files-with-matches bool_from_str .); do
  sed -i 's/bool_from_str/is_affirmative/' $file

which will execute

$ sed -i 's/bool_from_str/is_affirmative/' ./tests/test_utils.py
$ sed -i 's/bool_from_str/is_affirmative/' ./izk/utils.py
$ sed -i 's/bool_from_str/is_affirmative/' ./izk/prompt.py

I tend to find the for loop syntax clearer than xargs's. xargs can however execute the commands in parallel using its -P n options, where n is the maximum number of parallel commands to be executed at a time, which can be a performance win if your command takes time to run.


All these tools open up a world of possibilities, as allow you to extract data and transform its format, to make it possible to build entire workflows of commands that were possibly never intended to work together. Each of these commands accomplishes has a relatively small function (sort sorts, cat concatenates, grep filters, sed edits, cut cuts, etc).

Any given task involving text, can then be reduced to a pipeline of smaller tasks, each of them performing a simple action and piping their output into the next task.

For example, if we wanted to know how many unique IPs could be found in a log file, and that these IPs always appeared at the same column, we could:

  • grep lines on a pattern specific to lines containing an IP address
  • locate the column the IPs appear, and extract them with awk
  • sort the list of IPs with sort
  • compute the list of unique IPs with uniq
  • count the number of lines (aka, of unique IPs) with wc -l

As there is a plethora of text processing tools, either available by default or installable, there is bound to be many ways to solve any given task.

The examples in this article were contrived, but I suggest you read the amazing article Command-line Tools can be 235x Faster than your Hadoop Cluster to get a sense of how useful and powerful these text processing commands really are, and what real-life problems they can solve.


2.1: Count the number of files and directories located in your home directory.

2.2: Display the content of a file in all caps.

2.3: Count how many times each word was found in a file.

2.4: Count the number of vowels present in a file. Display the result from the most common to the least.

Essential Tools and Practices for the Aspiring Software Developer is a self-published book project by Balthazar Rouberol and Etienne Brodu, ex-roommates, friends and colleagues, aiming at empowering the up and coming generation of developers. We currently are hard at work on it!

The book will help you set up a productive development environment and get acquainted with tools and practices that, along with your programming languages of choice, will go a long way in helping you grow as a software developer. It will cover subjects such as mastering the terminal, configuring and getting productive in a shell, the basics of code versioning with git, SQL basics, tools such as Make, jq and regular expressions, networking basics as well as software engineering and collaboration best practices.

If you are interested in the project, we invite you to join the mailing list!




Editor guide
mjyc profile image
Michael Jae-Yoon Chung

Long live shell commands!