Imaculate

Posted on

# The tale of the triathlete and C++ performance

Let's solve the problem of the triathlete who wants to get a rough idea of their training efforts using C++. Given that running is twice as hard as biking and swimming is four times as hard as running, she wants the weighted sum of all her activities' provided as a list where an activity is made of distance and type. There is enough clarity to the problem so lets get to it.

### V1

The most straight forward solution looks like the function below:

``````double calculateTriathleteEffortV1(vector<int> distances, vector<string> types)
{
double result = 0.0;
for (int i = 0; i < distances.size(); i++)
{
int distance = distances[i];
string type = types[i];
if (type == "swim")
{
result += (distance * 8);
}
else if (type == "run")
{
result += (distance * 2);
}
else if (type == "bike")
{
result += distance;
}
}
return result;
}
``````

It sums up the activities' distances weighted by respective factors. It works just fine and runs in an average of 0.44 microseconds per run over 10 million runs. If we have to calculate this for billions of athletes, it will not scale.

### V2: Bit Manipulation

An easy way to optimize this is by using replacing multiplication with bit shift operations. This is possible because the factors are multiples of 2 and bit shifts are generally faster than arithmetic operations.

``````double calculateTriathleteEffortV2(vector<int> distances, vector<string> types)
{
double result = 0.0;
for (int i = 0; i < distances.size(); i++)
{
int distance = distances[i];
string type = types[i];
if (type == "swim")
{
result += (distance << 3);
}
else if (type == "run")
{
result += (distance << 1);
}
else if (type == "bike")
{
result += distance;
}
}
return result;
}
``````

The V2 function above runs in 0.47 microseconds. Surprisingly this 'optimization' has made things worse. The explanation lies somewhere in the assembly output. The compile command below saves the intermediate files including the assembly file which we should examine.

``````g++ -save-temps -fverbose-asm test_athlete.cpp -o test_athlete.exe
``````

The instructions from V1 and V2 are practically the same so it looks like the compiler does a better job at this optimization than us.

`````` # test_athlete.cpp:29:             result += (distance * 8);
movl    -64(%rbp), %eax  # distance, tmp112
sall    \$3, %eax     #, _7
# test_athlete.cpp:29:             result += (distance * 8);
cvtsi2sd    %eax, %xmm0  # _7, _8
movsd   -56(%rbp), %xmm1     # result, tmp114
addsd   %xmm1, %xmm0     # tmp114, tmp113
movsd   %xmm0, -56(%rbp)     # tmp113, result
jmp .L9  #

.
.
.

# test_athlete.cpp:52:             result += (distance << 3);
movl    -64(%rbp), %eax  # distance, tmp112
sall    \$3, %eax     #, _7
# test_athlete.cpp:52:             result += (distance << 3);
cvtsi2sd    %eax, %xmm0  # _7, _8
movsd   -56(%rbp), %xmm1     # result, tmp114
addsd   %xmm1, %xmm0     # tmp114, tmp113
movsd   %xmm0, -56(%rbp)     # tmp113, result
jmp .L16     #
``````

We are therefore better off with multiplication so we have to look for the gains elsewhere.

### V3: Branches

V2 branches between activity types using if-else-if statements where each if statement is processed in the arbitrary order we have set (swim -> run -> bike) even though each case is independent of the others. Using switch case statements allows the compiler to generate jump tables which are faster. We will attempt that and to do that we have to change the data type of activity type to `enum` because strings can't be used in switch-case statements.

``````typedef enum
{
RUN,
BIKE,
SWIM,
} Type;

.
.
.

double calculateTriathleteEffortV3(vector<int> distances, vector<Type> types)
{
double result = 0.0;
for (int i = 0; i < distances.size(); i++)
{
int distance = distances[i];
Type type = types[i];
switch (type)
{
case SWIM:
result += (distance * 8);
break;
case RUN:
result += (distance * 2);
break;
case BIKE:
result += distance;
break;
}
}
return result;
}
``````

This not only eliminates expensive if-else-if jumps also but string comparison of activity types and the results do speak for themselves. V3 runs in an average of 0.27 microseconds per run, almost twice as fast!
This is definitely an optimization we want to keep. Can we do better? Probably, lets try eliminating jumps altogether.

### V3: Branches

We can eliminate branches by using a look up table to translate activity types to multiplication factors.

``````typedef enum
{
RUN,
BIKE,
SWIM,
} Type;

int factors[] = {2, 1, 8};

.
.
.

double calculateTriathleteEffortV4(vector<int> distances, vector<Type> types)
{
double  result = 0;
for (int i = 0; i < distances.size(); i++)
{
int distance = distances[i];
Type type = types[i];
result += (distance * factors[type]);

}
return result;
}
``````

V4 runs in an average of 0.25 microseconds per run, marginally better than V3. The cost of jumps is replaced by array access which is only borderline faster. Although marginal, it is still faster, we'll therefore carry it forward as we continue to look for more performance opportunities. We have gone as far as we can to optimize the algorithm, lets try tweaking the data structures.

### V5: Constant vector references

We didn't give a lot of thought into the function API, it accepts vectors which are passed by value even though they are only read in the function. We can correct that by
a) Passing the vectors by reference; this should shave some time that is spent copying all elements.
b) Marking the vectors as constant to ensure the function doesn't change the original vectors.

``````double calculateTriathleteEffortV5(const vector<int>& distances, const vector<Type>& types)
{
double  result = 0;
for (int i = 0; i < distances.size(); i++)
{
int distance = distances[i];
Type type = types[i];
switch (type)
{
case SWIM:
result += (distance << 3);
break;
case RUN:
result += (distance << 1);
break;
case BIKE:
result += distance;
break;
}
}
return result;
}
``````

V5 runs in an average of 0.025 microseconds per run, almost 10x faster than V4!!! It might be hard to hard to top that but lets try changing the parameters from vectors to static arrays.

### V6: Array parameters

There is no need to pass dynamically sized vectors when the size never changes. Vector elements reside in the heap which makes access slower. Alternatively accessing static arrays elements is faster because they are on the stack. Arrays are passed to functions as pointers therefore we need not worry about copying on passing by value.

``````double calculateTriathleteEffortV6(int distances[], Type types[], int size)
{
double result = 0;
for (int i = 0; i < size; i++)
{
int distance = distances[i];
Type type = types[i];
switch (type)
{
case SWIM:
result += (distance << 3);
break;
case RUN:
result += (distance << 1);
break;
case BIKE:
result += distance;
break;
}
}
return result;
}
``````

V6 above runs in an average of 0.012 microseconds per run, almost twice as fast V5 and significantly faster than V1. We can probably do better but we'll stop here. The table summarizes the gains (or losses) each version brought,

Function Time (us) Delta from previous (%) Delta from V1 (%)
V1 0.44
V2 0.47 -6.38 -6.38
V3 0.27 74.07 62.96
V4 0.25 8 76
V5 0.025 900 1660
V6 0.012 108.33 3567

We have witnessed a speedup of almost 3567% in our function, applying very practical tweaks some of which didn't work. It is therefore very important to measure if attempted optimizations are working. The journey to performance doesn't end here, there is probably more ways to make it run faster, and a lot more ways to reduce the memory footprint. Benchmarks and profiling should guide these decisions. In the meantime, our triathletes will definitely appreciate the runtime improvements.