This is part 2 of a series on algorithm benchmarking and selection on x86_64 and AArch64 systems. You can find part 1 here. In the previous post we went through the algorithms we're to benchmark and broke some of their workings down, providing predictions along the way as to how they would stack up. Now it's time to put them to the test and see which comes out on top.
You may have noticed there was a gap in the numbering of the algorithms, between vol2.c and vol4.c. vol3.c is a dummy program provided to us without the volume scaling algorithm, so we can isolate the performance of that one function. Alternatively, we can do so with code by including the C time library and timing the scaling function. This method is less error prone so I'll be benchmarking the algorithms in this way.
The first step is to increase the sample size in our header to work with a substantial enough dataset in our benchmarking to get some meaningful results. I cranked up the sample number to 1600000000, after which I got to work inserting the timing code into each of the programs.
For example vol0.c looks like so:
// ---- This is the part we're interested in!
// ---- Scale the samples from in[], placing results in out[]
clock_t t;
t = clock();
for (x = 0; x < SAMPLES; x++) {
out[x]=scale_sample(in[x], VOLUME);
}
t = clock() - t;
printf("Time elapsed: %f\n", ((double)t)/CLOCKS_PER_SEC);
I then ran it in a loop to execute 20 times and send the output to a log file like so:
for ((i = 0; i < 20; i++)) ; do ./vol0 ; done |&tee vol0output.log
After following these steps for all the programs I had my results for the AArch64 system ready to compare. I did the same for the x86_64 algorithms, omitting the last two algorithms that use SIMD as they won't run on that architecture. The results are as follows:
AArch64 Results
Algorithm | Vol0.c | Vol1.c | Vol2.c | Vol4.c | Vol5.c |
---|---|---|---|---|---|
Time (s) | 5.286 | 4.644 | 11.257 | 2.756 | 2.837 |
5.251 | 4.587 | 11.258 | 2.776 | 2.777 | |
5.295 | 4.623 | 11.226 | 2.766 | 2.803 | |
5.277 | 4.573 | 11.239 | 2.784 | 2.784 | |
5.287 | 4.603 | 11.25 | 2.757 | 2.801 | |
5.283 | 4.568 | 11.229 | 2.796 | 2.787 | |
5.283 | 4.581 | 11.234 | 2.74 | 2.8 | |
5.311 | 4.566 | 11.244 | 2.782 | 2.806 | |
5.287 | 4.601 | 11.233 | 2.848 | 2.796 | |
5.244 | 4.639 | 11.244 | 2.756 | 2.755 | |
5.279 | 4.558 | 11.239 | 2.744 | 2.763 | |
5.293 | 4.56 | 11.236 | 2.782 | 2.79 | |
5.288 | 4.632 | 11.233 | 2.73 | 2.886 | |
5.27 | 4.591 | 11.262 | 2.775 | 2.818 | |
5.277 | 4.567 | 11.243 | 2.721 | 2.836 | |
5.31 | 4.576 | 11.234 | 2.812 | 2.799 | |
5.295 | 4.552 | 11.237 | 2.784 | 2.789 | |
5.25 | 4.567 | 11.23 | 2.776 | 2.806 | |
5.283 | 4.565 | 11.235 | 2.798 | 2.762 | |
5.248 | 4.564 | 11.215 | 2.823 | 2.824 | |
Average | 5.279 | 4.585 | 11.238 | 2.775 | 2.800 |
It looks like these results more or less confirm what we predicted, with the fastest 2 being those that took advantage of SIMD optimization to run concurrently. Vol2.c was way behind in execution time at a whopping average of 11.238 seconds per execution, over double the next slowest algorithm. This confirms that precalculating a table of results can be incredibly costly in compute time due to the cache not being fast enough to outpace the math unit of the processor. The naïve approach in Vol0.c of multiplying each sample by a scale factor with multiple type conversions in the process somewhat unsurprisingly takes the second slowest pace. Avoiding the conversions by bit shifting in Vol1.c yields a slightly faster runtime. Now onto the x86_64 results:
x86_64 Results
Algorithm | Vol0.c | Vol1.c | Vol2.c |
---|---|---|---|
Time (s) | 2.91 | 2.755 | 3.574 |
2.849 | 2.762 | 3.552 | |
2.764 | 2.747 | 3.543 | |
2.753 | 2.739 | 3.502 | |
2.763 | 2.771 | 3.497 | |
2.761 | 2.739 | 3.503 | |
2.77 | 2.774 | 3.527 | |
2.77 | 2.77 | 3.507 | |
2.782 | 2.751 | 3.5 | |
2.752 | 2.763 | 3.496 | |
2.765 | 2.757 | 3.53 | |
2.753 | 2.757 | 3.501 | |
2.776 | 2.759 | 3.515 | |
2.771 | 2.758 | 3.527 | |
2.768 | 2.761 | 3.5 | |
2.758 | 2.777 | 3.518 | |
2.783 | 2.749 | 3.499 | |
2.764 | 2.747 | 3.496 | |
2.772 | 2.752 | 3.504 | |
2.777 | 2.756 | 3.502 | |
Average | 2.778 | 2.757 | 3.514 |
The execution times on x86_64 tell a similar story, although there are a few interesting distinctions. First, the type conversions that set Vol0.c back so much in the AArch64 benchmarks seem to have much less of an impact here. Vol0.c and Vol1.c share almost exactly the same runtime, although working with one type and bit shifting did shave off a few milliseconds. Also of note is that Vol2.c doesn't seem to incur the massive performance penalty seen on its AArch64 counterpart. This is evidence that the cache on this machine's processor is much closer to the math unit in terms of getting the results we need.
In conclusion, this was an eye opening experience that confirmed my knowledge about the advantages of SIMD while giving specific evidence to support just how fast it is compared to traditional processing. We also learned just how important it is to know the machine you're optimizing for intimately, to account for differences like that between the algorithm using the precalculated table on the AArch64 machine vs the x86_64 one. Doing so can inform your programming decisions and help avoid making costly assumptions that in this case might mean more than doubling your runtime.
Top comments (0)