DEV Community

loading...

Runing linux commands in parallel

Sérgio Araújo
I am a Free Software enthusiast and a (neo)?vim addicted, I also like shell script, sed, awk, and as you can see I love Regular Expressions.
Updated on ・3 min read

Intro

In my daily routine I like to watch tv-series to improve my English. Sometimes I need to read some tv-series transcriptions to figure out what people are saying in them, so I am used to download the whole series transcriptions.

Many sites have those transcriptions in HTML format, to convert a web site to plain text I usually do:

lynx --dump URL > result.txt
Enter fullscreen mode Exit fullscreen mode

Some sites use the "user-agent" directive to identify if you are trying to use some forbiden download tool. So I have the following alias:

alias lynx='lynx -display_charset=utf-8 -useragent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.79 Safari/537.1"'
Enter fullscreen mode Exit fullscreen mode

Any site will assume that my lynx is actually a firefox browser.

Filtering the content

Let's say I want to figure out what episodes have this string:

A little more, a little more.
Enter fullscreen mode Exit fullscreen mode

For those cases I use the Ag or The Silver Searcher a faster grep command. The command ends up bein:

ag 'A little more, a little more.' -i .
Enter fullscreen mode Exit fullscreen mode

The result:

friends-transcript-s10e16.txt
328:A little more, a little more.
Enter fullscreen mode Exit fullscreen mode

The first line gives me the file name, the second one gives me the line number and its content.

Downloading all the transcriptions

The problem is exactly the transcriptions, how can I download all of them at once?

I have found a site where I can download all the transcriptions using the above lynx command, but how pass all the different URL variations with season and episode?

GNU Parallel comes in handy

First I set a "main url" variable:

url='https://www.springfieldspringfield.co.uk/view_episode_scripts.php\?tv-show\=friends\&episode\='
Enter fullscreen mode Exit fullscreen mode

How I can pass to GNU Parallel a range of 10 seasons with 24 episodes each? Simple, using shell expansion like this:

echo friends-s{01..10}e{01..24}.txt
Enter fullscreen mode Exit fullscreen mode

The last season has only 18 episodes, and @samvittighed helped me with this (see comments section):

s{01..09}e{01..24} s10e{01..18}
Enter fullscreen mode Exit fullscreen mode

And how about the download command, how would it be?

parallel --verbose -j20 lynx --dump $url{} '>' friends-transcript-{}.txt ::: s{01..09}e{01..24} s10e{01..18}
Enter fullscreen mode Exit fullscreen mode

Remember, the $url variable was defined above.

It was possible to do the same using pure shell script?

Yes, but it is Huugelly slower than using GNU Parallel. It would be like:

    mainurl='https://www.springfieldspringfield.co.uk/view_episode_scripts.php?tv-show=friends&episode='

clear
echo "--------------------------------------------------------"
echo "Baixando transcrições para todos os episódios de Friends"
echo "--------------------------------------------------------"

for i in {01..10}; do
    season=s${i}
    echo

    for j in {01..24}; do
        episode=e${j}

        echo "Baixando temporada $i episódio $j"
        lynx --dump "${mainurl}${season}${episode}" > "friends-${season}${episode}.txt"

        if [ "$i" == 10 ] && [ "$j" == 18 ]; then
            exit
        fi
    done
done
Enter fullscreen mode Exit fullscreen mode

The speed of GNU Parallel is something amazing, because it downloads 20 files at a time instead of one by one sequentially.

If you have any tip to improve this article feel free to share.

Cleaning all the files with vim

After downloading all files I wanted to delete the first 18 lines of each of them and also the last paragraph. So, I did:

vim *.txt
:silent argdo normal! gg18dd
:silent argdo normal! G{kdG
:silent argdo update
:qall
Enter fullscreen mode Exit fullscreen mode

The first "argdo" runs a normal command, jumping to the first line of each file with "gg The second "argdo" command jumps to the end of the file "G", goes back one paragraph with "{", jumps up one line with "k" and delete to the end with "dG". The ":silent argdo update" command writes all of the files and the ":qall" command exists from all the files.

Editing the files with sed

Getting rid of undesired lines with sed it is pretty simple:

sed -i '1,18d' *.txt
sed -i '/References/,$d' *.txt
Enter fullscreen mode Exit fullscreen mode

The first command deletes the first 18 lines. The second one deletes from "References" until the end of the file.

Discussion (2)

Collapse
samvittighed profile image
Sam Vittighed

parallel --verbose -j20 lynx --dump $url{} '>' friends-transcript-{}.txt ::: s{01..09}e{01..24} s10e{01..18}

Collapse
voyeg3r profile image
Sérgio Araújo Author

Thanks a lot! I have been thinking about how much we learn as soon as we start sharing knowledge, exactly because other kind people are always willing to help. Actually I am thinking of writing an article about this, I would be a collaborative article to reinforce the idea. How do you like it?