Boyd Duffee

Posted on Mar 15

Faster tetranucleotide (k-mer) frequencies!

#perl #bioinformatics #profiling #softwaredevelopment

I saw Jennifer's post about re-writing her perl scripts in python and how she saw a 2.5 times improvement.

How could this be? My favourite language can't be that slow.
It must be programmer error.

I have an interest in Perl and Science, so time to roll up sleeves and learn me some profiling/benchmarking. What follows is my internal monologue and the notes I scribbled down during the learning process. For those that want to follow along, I've created a small repo.

Getting started

Voltaire said that Hell is other people's code. My first step was to re-write it into Modern Perl and in the process, understand what each line does. When it's written idiomatically, it's easier to refactor and I should be able to make some minor performance improvements along the way.

Assume that the original script has been tested enough. For me to be correct, I've got to produce the exact same output. I got close, except for the header line.
line 99 print OUT "\t$prefix_$j"; becomes
line 89 print $out_fh "\t$j"; Yes, that's a bug because $prefix_ doesn't exist.

Search "benchmarking tools for linux" and decide that hyperfine is good for what I'm doing. Run Jennifer's new python script against my refactored perl and find that the python is 1.26 times faster for k=3 and 1.47 times faster for k=4. For the Covid-19 sequence, these are both on the order of hundreds of milliseconds.

hyperfine --warmup 3 'perl/get_kmer_frequencies.pl Covid-19_seq.fasta 3 boyd1' 'python/get_kmer_frequencies.py -i Covid-19_seq.fasta -k 3 -p boyd2'

Ok, not bad. Better than 2.9 times faster, but that's probably down to the way that hyperfine warms the cache and separates out User time from System time.

Oh, I should just check how much I improved when I refactored. Run it against Jennifer's original perl script and ... hers was 1.1 times faster. Well, that was a bit embarrassing.

ahem I was ... aiming at improving readability, .. maintainability, y'know best practice and all that. That's my story and I'm sticking to it. ;)

For sanity's sake

Check that the output of the new file is the same as the original, otherwise you've messed up the refactoring. I started using this test script with prove to make it quick and easy.
Saved as i.t, I run it with prove i.t for the lols.
It gets noisy when there's a problem, so I go back to running it by hand.

use Test2::V0;

my $standard = 'get_kmer_frequencies.pl';
my @files = sort { -M $a <=> -M $b } glob 'get_kmer*';
ok my $latest = shift @files;
isnt $latest, $standard, 'Files to compare';

my @args = qw'Covid-19.fasta 3';
ok system('perl', $standard, @args, 'A') == 0, 'Make A_kmers.txt';
ok system('perl', $latest, @args, 'B') == 0, 'Make B_kmers.txt';

is `diff A_kmers.txt B_kmers.txt`, q{}, 'No differences in output';

done_testing();

Clever people do this from the start.
I did this after a bug I introduced messed up the output and I hadn't immediately noticed. What it was is that I changed the key separator to a character that was found in some of those keys and it then split those keys. Oops.

NYTProf time

When you get serious about optimizing programs, trying to enhance performance, you reach for profiling tools that can analyze your code's memory or time complexity. In Perl, Devel::NYTProf comes highly recommended. I use it to collect data on the number of times each statement is called and how long it spends executing it. That way I can work out where to invest the effort making the script faster, what gives the most bang for the buck.

Grab the profiler and run

perl -d:NYTProf get_kmer_frequencies.pl Covid-19_seq.fasta 3 boyd1

and open up the nytprof/index.html using nytprofhtml --open to see

Calls   P   F   Exclusive Time  Inclusive Time  Subroutine
9653    1   1   31.6ms  31.6ms  main::rc_seq
25      2   1   28.1ms  59.7ms  main::process_it
82498   7   1   9.46ms  9.46ms  main::CORE:print (opcode)
3175    4   1   7.89ms  7.89ms  main::CORE:sort (opcode)

Sorting out the sort

Obviously, the rc_seq is the big sub that needs attention, but what about that sort? Quickly looking at the sort on Line 78 for my $i (keys %knucs) I see that there's no reason to sort those keys. Saved one sort and the script runs about the same. There's another sort inside a loop which can be extracted out of the loop. Extracting that made it run 1.15 times faster!

Changing the header line (line 87) from a for loop to a join over a list is 1 or 2 percent faster.

How do I print thee? Let me count the ways.

Messing about with printing in the inner loop didn't gain much, but changing the key separator from a tab "\t" (interpolated string) to an underscore '_' (a string literal) made a 10% improvement. (it also introduced the bug noted above because the keys used the underscore. changed it to a colon - bug gone)

say is marginally slower than print so use print inside the loop that gets called a lot to save maybe 10% on that call. From 32ms to 28ms is a small, but nice gain for a one line change.

rc_seq - transforming the sequence

The rc_seq sub is an if-elsif block that splits a string into individual characters, translates ACGT into their complement (TGCA), reverses the array and joins it back into a string.

Being Perl, we can manipulate and reverse the string in-place. The change makes it shorter and more obvious (sometimes it runs faster). Actually, I ran this through the profiler and the sub now runs 5 times faster.

process_it - collecting the frequencies

This sub does the work of splitting the sequence into kmers and counting them. The longest time spent here is incrementing the %knucs hash.

The second longest time is spent turning the sequence into an array of letters to create all the kmer substrings. Splitting isn't bad, but joining sets of letters together is. Use the string function, substr instead and speed that line up by 5 times.

Now marginally faster the python script in speed. Over 20% faster for k=3, and 5% (+/- 5%) faster for k=4. That's a decent improvement.

Like the end of a great song ... a Key change!

line 91

my @items = map { my $key = join ':', $_, $i; $knucs{$key} // 0 } @record_keys;

The script spends most of its time (55ms!!!), longer than anything else, on this line.

Assume that the problem isn't the map but constructing the key for the lookup. Change the key to a 2 dimensional lookup and see if that improves things.

WHEN you finally get it right (and remember the correct order that you construct the keys in), line 92 is now 2.5 times faster than before and the perl script is now 40% faster than the python script.

Keys are constructed/used on lines 79, 91, 135

STOP!!!!

Know when to stop.

There are no more obvious or easy gains here. Any more work is likely to yield small returns. Go outside, have a life or at the least consult the relevant chart.

Well, after thinking a while, maybe constructing the output could be improved, but I'm moving on. I've exceeded my goal of making the perl script as fast as the python script and learned more about refactoring and profiling. A bit like how audiophiles use your music to listen to their equipment, I've used Jennifer's science to better understand my Perl and had fun doing it.

There's a niggling thought at the back of my mind, now that I feel I better understand the purpose of the script, whether BioPerl can do this even faster. I will leave that for another day. Oh, look glycine has already done most of the hard work for me. Many thanks!

Lessons learned

In summary, these are reflections on the changes that I made in chronological order. This may be someone's first time considering performance, so I include basic rules of thumb I used along with the things I did not know before.

Modern Perl style adds a small amount of overhead, but the sanity it brings is a price worth paying.
Streamline a method of checking the output hasn't changed
Don't sort when order is not important
Calculate constant values outside of loops
Use built-in list functions over loops (join instead of for)
Interpolated strings are slower than string literals (prefer single quotes over double quotes)
say is slightly slower than print. Avoid it in heavy loops.
substr is faster than spliting and joining
Creating a single hash key is slower than using a 2 level hash

The Human Genome is way too large. Grab the protein sequence for Caenorhabditis elegans. It takes about 5 minutes to run.

Run your frequent tests with the Covid-19 sequence. Repeated runs with anything larger take too long for rapid turnaround.

WARNING: hyperfine will run each program 20 to 40 times to get decent statistics. You won't want to wait around for a file that takes 5 minutes to process a single run.

I'll leave you with a couple of related references for further reading, chrisarg's work on parsing FastQ files fast
and a marketsplash tutorial on Perl code profiling tools

In Conclusion

My corollary to Cunningham's Law:
Don't ask people how to make your code run faster;
Tell them their language is slow

It's taken a lot longer to reply Jennifer's post than I'd anticipated, but right now I have the warm glow that comes from being able to say, (until someone iterates on the above corollary)

... Python is SLOW!

DEV Community

Faster tetranucleotide (k-mer) frequencies!

Getting started

For sanity's sake

NYTProf time

Sorting out the sort

How do I print thee? Let me count the ways.

rc_seq - transforming the sequence

process_it - collecting the frequencies

Like the end of a great song ... a Key change!

STOP!!!!

Lessons learned

In Conclusion

Top comments (0)

Read next

Understanding Event-Driven Architecture in C# with Real-Life Scenarios

Mastering CSS @property Binding: A Guide for Web Developers

Deep Dive Into Git(Part II)

The Big Power of Small Pull Requests