DEV Community

Tecca Yu
Tecca Yu

Posted on

Algorithm Selection LAB 5 in x86_64 and AArch64 - part 2

Hi this is Tecca, this post will be a continuation of last post for comparing the relative performance of various algorithms on a same machine across several implementations of AArch64 and x86_64 systems.

In this post I will be demonstrating the result of the performance of each algorithms mentions in last post. And going over some questions relate to the algorithms.

We want to measure the performance of each algorithm specifically and nothing else should be in the way in order to give a correct measurement of the time elapsed performing the algorithm.

In order to do that we need to make some modification to all the files.

#include <time.h>

        clock_t         start_t, end_t;

        start_t = clock();

// ---- This is the part we're interested in!
// ---- Scale the samples from in[], placing results in out[]
        for (x = 0; x < SAMPLES; x++) {
                out[x]=scale_sample(in[x], VOLUME);
        }

        end_t = clock();
        printf("Time elapsed: %f\n", ((double)start_t - end_t)/CLOCKS_PER_SEC);

every other code goes blow..
Enter fullscreen mode Exit fullscreen mode

The key point is to wrap the for loop which performs the scaling of the samples to get the most accurate result of the algorithm when it runs.

Declaration for testing

  • Each file was compiled with the gcc compiler with gcc -g -O2 -fopt-info-vec-all -fno-lto -fno-tree-vectorize -fno-tree-loop-vectorize.
  • 1,500,000,000 Sample was define for each algorithm to test.
  • Each algorithm/file was tested 15 times.

Example running vol0 in x86_64 systems
Image description

Benchmarking Result running in x86_64 systems

vol0 vol1 vol2
1 2.696961 2.655414 3.370985
2 2.666739 2.649839 3.302379
3 2.671609 2.648382 3.30752
4 2.664414 2.653646 3.327133
5 2.660572 2.679785 3.285885
6 2.671767 2.639657 3.302924
7 2.685051 2.633608 3.396682
8 2.674105 2.663188 3.352611
9 2.675021 2.663653 3.331424
10 2.677584 2.651484 3.291357
11 2.655889 2.653947 3.314665
12 2.659099 2.634871 3.287128
13 2.591898 2.66246 3.283349
14 2.645027 2.662764 3.327623
15 2.582574 2.662238 3.298565
average 2.658554 2.654329067 3.318682
median 2.666739 2.653947 3.30752

In the x86_64 system vol1 algorithm is the fastest, second comes vol0, and vol2 is the slowest out of all three. As what I was expecting in the previous post, because vol2 pre-calculates all results, and looking up answers for each input value afterward absolutely will cost more resources than the others and vol1 beats vol0 by a little by using fixed-point calculation. Which avoids the cost of repetitively casting between integer and floating point.

We will be testing algorithms in vol4 and vol5 in AArch64 system because the utilize SIMD instructions the programs used are written unique to the AArch64 system.

Benchmarking result running in AArch64 systems

vol0 vol1 vol2 vol4 vol5
1 4.980222 4.313869 10.567804 3.481179 4.04965
2 4.968114 4.323427 10.560218 3.456832 4.045124
3 4.959307 4.30239 10.59247 3.460554 4.045296
4 4.973315 4.325246 10.590718 3.469896 4.028074
5 4.961954 4.31275 10.585899 3.493608 4.039965
6 4.976137 4.320063 10.532851 3.438831 4.040304
7 4.959057 4.349743 10.642468 3.482111 4.055758
8 4.960718 4.317437 10.534451 3.469382 4.150955
9 4.95485 4.336698 10.548297 3.451517 4.085376
10 4.960642 4.329521 10.552774 3.455712 4.022335
11 4.952321 4.332141 10.550225 3.384459 4.041784
12 4.990002 4.334904 10.572559 3.423148 4.058293
13 4.942643 4.326342 10.545258 3.420771 4.096189
14 4.957297 4.317604 10.548684 3.391878 4.020974
15 4.967439 4.317819 10.558838 3.433111 4.062904
average 4.964267867 4.323996933 10.5655676 3.4475326 4.056198733
median 4.960718 4.323427 10.558838 3.455712 4.045296

Looking at the results from running vol0, vol1, vol2, vol4, vol5.

In the previous post, we assumed the algorithms that use SIMD instructions would perform faster than others(vol4 and vol5). We can observe that vol4 and vol5 algorithms definitely outperform others. The performance difference between the two come very close to each other. I believe it is because both algorithm inline assembly and compiler intrinsic are almost equally fast.

vol1 still has a better performance than vol0 in AArch64 as well.

vol2 algorithm became significantly slower than other algorithms. This result could possibly mean that the CPU is slow at reading the memory when looking over the pre-calculated values.

QUESTIONS SPECIFIC TO ALGORITHMS

Q: Why is this code needed?

for (x = 0; x < SAMPLES; x++) {
                ttl=(ttl+out[x])%1000;
        }

 printf("Result: %d\n", ttl);
Enter fullscreen mode Exit fullscreen mode

This code prints out the number of scaled samples in the time the program was run. This is meant to be a post-processing value that is displayed to the user, which can be used to determine how long a certain algorithm takes.

Q: What does this next block do? Why?

ifeq ($(shell echo | gcc -E -dM - | grep -c aarch64),1)
        BINARIES:=vol0 vol1 vol2 vol3 vol4 vol5
else
        BINARIES:=vol0 vol1 vol2 vol3
Endif
Enter fullscreen mode Exit fullscreen mode

This block within the Makefile validates which architecture the user is currently in and changes “BINARIES” to different values.
Since vol4.c and vol5.c are design in a way that can only be run in AArch64,it is important to remove it from the BINARIES if the user is not in a aarch64 system to prevent it from building and causing error.

Q: What is the purpose of the cast to unint16_t in the next line?

precalc[(uint16_t) x] = (int16_t) ((float) x * VOLUME / 100.0);
Enter fullscreen mode Exit fullscreen mode

I guess this is some kind of casting to the values to stay the same size (16-bit) even if it is in a different system.

Q: What's the point of this dummy program? How does it help with benchmarking?

The dummy program(vol3.c) helps the calculation of each algorithm by doing nothing but creating the samples and count the results. Basically everything but algorithm. Which means it's a baseline in terms of processing time to all the other files that contains different algorithms.

Q: what is the purpose of these next two lines?

in_cursor = in;
        out_cursor = out;
        limit = in + SAMPLES;
Enter fullscreen mode Exit fullscreen mode

These two lines prepare the two int16_t pointers and points them to the in and out arrays respectively.

Q: Are the results usable? Are they accurate?

printf("Result: %d\n", ttl);
Enter fullscreen mode Exit fullscreen mode

The results for vol5 and vol4 are the same, but when comparing them to the naïve implementation the results vary by a lot so I'm not too sure about the accuracy.

Q: Why is the increment below 8 instead of 16 or some other value?

Q: Why is this line not needed in the inline assembler version of this program?

 in_cursor += 8;
 out_cursor += 8;
Enter fullscreen mode Exit fullscreen mode

Incrementing the in_cursor and out_cursor by 8 is because the values are 8 bits apart from each other.

In the inline assembler version, chances are the assembly code automatically does it within its logic. But with C language, we have to be specific about it.

Q: what does it mean to "duplicate" values in the next line?

__asm__ ("dup v1.8h,%w0"::"r"(vol_int)); // duplicate vol_int into v1.8h
Enter fullscreen mode Exit fullscreen mode

dup loads the scaling factor and loads it into a 128-bits wide register.

Conclusion

In the end, the measurement of the performance of each algorithm is done to test the assumptions made in the part-1 post of this lab.
As expected, the algorithms(vol4&vol5 in this case) that use SIMD instructions appear to outperform others as they are best at processing multiple data simultaneously.

Top comments (0)