Intro
In my daily routine I like to watch tv-series to improve my English. Sometimes I need to read some tv-series transcriptions to figure out what people are saying in them, so I am used to download the whole series transcriptions.
Many sites have those transcriptions in HTML format, to convert a web site to plain text I usually do:
lynx --dump URL > result.txt
Some sites use the "user-agent" directive to identify if you are trying to use some forbiden download tool. So I have the following alias:
alias lynx='lynx -display_charset=utf-8 -useragent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.79 Safari/537.1"'
Any site will assume that my lynx
is actually a firefox browser.
Filtering the content
Let's say I want to figure out what episodes have this string:
A little more, a little more.
For those cases I use the Ag
or The Silver Searcher a faster grep
command. The command ends up being:
ag 'A little more, a little more.' -i .
The result:
friends-transcript-s10e16.txt
328:A little more, a little more.
The first line gives me the file name, the second one gives me the line number and its content.
Downloading all the transcriptions
The problem is exactly the transcriptions, how can I download all of them at once?
I have found a site where I can download all the transcriptions using the above lynx
command, but how pass all the different URL variations with season and episode?
GNU Parallel comes in handy
First I set a "main url" variable:
url='https://www.springfieldspringfield.co.uk/view_episode_scripts.php\?tv-show\=friends\&episode\='
How I can pass to GNU Parallel a range of 10 seasons with 24 episodes each? Simple, using shell expansion like this:
echo friends-s{01..10}e{01..24}.txt
The last season has only 18 episodes, and @samvittighed helped me with this (see comments section):
s{01..09}e{01..24} s10e{01..18}
And how about the download command, how would it be?
parallel --verbose -j20 lynx --dump $url{} '>' friends-transcript-{}.txt ::: s{01..09}e{01..24} s10e{01..18}
Remember, the $url
variable was defined above.
It was possible to do the same using pure shell script?
Yes, but it is Huugelly slower than using GNU Parallel. It would be like:
mainurl='https://www.springfieldspringfield.co.uk/view_episode_scripts.php?tv-show=friends&episode='
clear
echo "--------------------------------------------------------"
echo "Baixando transcrições para todos os episódios de Friends"
echo "--------------------------------------------------------"
for i in {01..10}; do
season=s${i}
echo
for j in {01..24}; do
episode=e${j}
echo "Baixando temporada $i episódio $j"
lynx --dump "${mainurl}${season}${episode}" > "friends-${season}${episode}.txt"
if [ "$i" == 10 ] && [ "$j" == 18 ]; then
exit
fi
done
done
The speed of GNU Parallel is something amazing, because it downloads 20 files at a time instead of one by one sequentially.
If you have any tip to improve this article feel free to share.
Cleaning all the files with vim
After downloading all files I wanted to delete the first 18 lines of each of them and also the last paragraph. So, I did:
vim *.txt
:silent argdo normal! gg18dd
:silent argdo normal! G{kdG
:silent argdo update
:qall
The first "argdo
" runs a normal command, jumping to the first line of each file with "gg
The second "argdo
" command jumps to the end of the file "G
", goes back one paragraph with "{
", jumps up one line with "k
" and delete to the end with "dG
". The ":silent argdo update
" command writes all of the files and the ":qall
" command exists from all the files.
Editing the files with sed
Getting rid of undesired lines with sed it is pretty simple:
sed -i '1,18d' *.txt
sed -i '/References/,$d' *.txt
The first command deletes the first 18 lines. The second one deletes from "References" until the end of the file.
Top comments (2)
parallel --verbose -j20 lynx --dump $url{} '>' friends-transcript-{}.txt ::: s{01..09}e{01..24} s10e{01..18}
Thanks a lot! I have been thinking about how much we learn as soon as we start sharing knowledge, exactly because other kind people are always willing to help. Actually I am thinking of writing an article about this, I would be a collaborative article to reinforce the idea. How do you like it?