Chris James

Posted on Aug 12, 2018

Fun with Bash

#bash #beginners

This was originally posted on my blog a while ago

I was given a fairly mundane task at work and when this happens I try and find ways to make it interesting. I will usually find a way to learn something new to do the task so I can always get something out of the work I do, even if it's dull.

I would quite like to get better at using the legendary tools available to me on the command line. I have never felt super confident doing text processing there, often giving up and just going to my editor instead.

I am not going to exhaustively document every command I do, if you're curious about the details consult man $command just like how I did! This post will just be illustrating how by doing a little research you can quickly do some rad text processing.

The task

I am trying to analyse how much memory we are provisioning compared to actual usage for some of our services to see if our AWS bill can be cut down a bit.

The data

I thought it would be fun (??) to use CSV, as it feels like CSV like data. Here is a sample of the data I captured.

name, usage (mb), allocated (mb), CPU %, containers
dispatcher, 150, 512, 40, 10
assembler, 175, 512, 75, 10
matcher, 85, 512, 15, 10
user-profile, 128, 512, 40, 5
profile-search, 220, 512, 80, 10
reporter, 90, 512, 40, 10 
mailgun-listener, 90, 512, 10, 5
unsubscribe, 64, 512, 3, 5
bounce, 8, 128, 0.5, 3
legacy-reporting, 30, 512, 15, 3
content-store, 80, 256, 30, 10
legacy-alert-poller, 64, 256, 1, 1
migrator, 80, 256, 10, 5
entitlements-update, 150, 256, 70, 3

Display it nice

This is nice and easy column -s, -t < data.csv

-t determines the number of columns the input contains and combined with -s specifies a set of characters to delimit by. If you dont specify a -s it defaults to using space.

name                  usage (mb)   allocated (mb)   CPU %   containers
dispatcher            150          512              40      10
assembler             175          512              75      10
matcher               85           512              15      10
user-profile          128          512              40      5
profile-search        220          512              80      10
reporter              90           512              40      10 
mailgun-listener      90           512              10      5
unsubscribe           64           512              3       5
bounce                8            128              0.5     3
legacy-reporting      30           512              15      3
content-store         80           256              30      10
legacy-alert-poller   64           256              1       1
migrator              80           256              10      5
entitlements-update   150          256              70      3

Sorting by usage

cat data.csv | sort -n --field-separator=',' --key=2 | column -s, -t

name                  usage (mb)   allocated (mb)   CPU %   containers
bounce                8            128              0.5     3
legacy-reporting      30           512              15      3
legacy-alert-poller   64           256              1       1
unsubscribe           64           512              3       5
content-store         80           256              30      10
migrator              80           256              10      5
matcher               85           512              15      10
mailgun-listener      90           512              10      5
reporter              90           512              40      10 
user-profile          128          512              40      5
dispatcher            150          512              40      10
entitlements-update   150          256              70      3
assembler             175          512              75      10
profile-search        220          512              80      10

--key=2 means sort by the second column

Using awk to figure out the memory differences

What we're really interested in is the difference between the amount of memory provisioned vs usage.

awk -F , '{print $1, $3-$2}' data.csv

Let's pipe that into column again

awk -F , '{print $1, $3-$2}' data.csv | column -t

name                 0
dispatcher           362
assembler            337
matcher              427
user-profile         384
profile-search       292
reporter             422
mailgun-listener     422
unsubscribe          448
bounce               120
legacy-reporting     482
content-store        176
legacy-alert-poller  192
migrator             176
entitlements-update  106

This is nice but it would be good to ignore the first line.

awk -F , '{print $1, $3-$2}' data.csv | tail -n +2

tail -n X prints the last X lines, the plus inverts it so its the first X lines.

Sort mk 2

Now we have some memory differences it would be handy to sort them so we can address the most inefficient configurations first

awk -F , '{print $1, $3-$2}' data.csv | tail -n +2 | sort -r -n --key=2

And of course use column again to make it look pretty

awk -F , '{print $1, $3-$2}' data.csv | tail -n +2 | sort -r -n --key=2 | column -t

legacy-reporting     482
unsubscribe          448
matcher              427
reporter             422
mailgun-listener     422
user-profile         384
dispatcher           362
assembler            337
profile-search       292
legacy-alert-poller  192
migrator             176
content-store        176
bounce               120
entitlements-update  106

WTF

There it is! The utterly indecipherable bash command that someone reads 6 months later and scratches their head. In fact there has been 2 weeks since I wrote the first draft of this and I look at the final command and weep.

It is very easy to throw up your hands when you see a shell script that doesn't make sense but there are things you can do.

Remember that the process will usually start small like we have here, starting with one command, piping it into another, into another. This gives the lazy dev like me the perception that it is a complicated command but all it really is a set of steps to process some data. So if you're struggling you can wind back some of the steps for yourself by just deleting some of the steps and see what happens.

If it is an important business process that needs to be understood for a long time, you're probably better off writing it in a language where you can write some automated tests around it.

But a lot of the work we do does involve doing things that are ad-hoc and doesn't reside in "the codebase" where you can easily get into a TDD rhythm to accomplish something shiny. Often you have to do something a little boring and sometimes the tools available on your computer can really help you out. They're so old and well established that you can find tons of documentation and tips so dive in!

If you need more reasons to get to know the shell better read how command line tools can be up to 235x faster than your hadoop cluster

Oldest comments (5)

Mathieu PATUREL • Aug 12 '18 • Edited

Cool command! You could remove the head bit though with awk.

awk 'condition { command }'

By default the condition is NUL, which matches every line (so the command is run on every line). You have access to different variable (in both the condition and the command I think), one of them being NR for the line number.

So, if you set the condition to NR!=1, it'll run on every line different than 1 (which is the first one).

So, your command can be shortened from:

awk -F , '{print $1, $3-$2}' data.csv | tail -n +2 | sort -r -n --key=2 | column -t

awk -F , 'NR!=1{print $1, $3-$2}' data.csv | sort -r -n --key=2 | column -t

:smile:

I learned a bunch otherwise, so thanks!

Chris James • Aug 12 '18

Thanks for the improvements!

Ben Sinclair • Aug 12 '18

I've never used column before. That's pretty cool.
As far as the WTFness of it all goes, if you split it up and assign variables with nice names it'll be easy enough to understand. There's something about shell scripts though that make people write things as tersely as possible. I'd end up with something like that and refer back to it in my history if I needed it again soon after, but if I made it into a script I'd either comment the hell out of it or split it into parts, I think.

Bugfix:

tail -n X prints the last X lines, the plus inverts it so its the first X lines.

if it did that, it'd be head :) The plus means take the offset from the start of the file instead of the end.

Vlastimil Pospichal • Aug 12 '18

data.csv

name, usage (mb), allocated (mb), CPU %, containers
dispatcher, 150, 512, 40, 10
assembler, 175, 512, 75, 10
matcher, 85, 512, 15, 10
"user, profile", 128, 512, 40, 5
profile-search, 220, 512, 80, 10

column -s, -t < data.csv

name             usage (mb)   allocated (mb)   CPU %   containers
dispatcher       150          512              40      10
assembler        175          512              75      10
matcher          85           512              15      10
"user            profile"     128              512     40           5
profile-search   220          512              80      10

Not applicable to general CSV.

Khillo81 • Sep 18 '19

May I suggest csvkit as a means to work with CSV data at the command line? It has several nifty tools to handle CSV files including getting rid of the column headers, splitting and merging several tables by columns, and even has ways of converting Excel xlsx tables to csv. I would recommend giving it a try if you have to deal with these types of files at the command line.