Discussion on: Branchless programming. Does it really matter?

View post

Comment hidden by post author - thread only visible in this permalink

Hello

In the above program the until the if statement is evaluated the program follows a single control path. After executing the condition "a<b" the program is divided into two branches based on the expression evaluation result.

The above translates to machine code as follows.

The assembly you show is for -O0. If you compile with -O1 instead, you will get:

main:
        mov     r0, #10
        bx      lr

We NEVER want to analyze code compiled with -O0 when it comes to optimization. Why? Because it doesn't make any sense.

eg. 1) Maximum of 2 numbers

Looking at the various implementations, version 1 works very well : see godbolt.org/z/sj6xzM

Much better than the ugly version 2.

I am not familiar enough with the cost of each opcode to say whether version 3 is faster than version 1.

eg. 3) Accessing array element

This time, the so-called "optimized version" is clearly less optimized! See godbolt.org/z/h3dhT7

The modulo operator calls __aeabi_idivmod and we all know that division cost at a lot!

We can also see that there is no conditional in the first version...

eg. 1) Absolute Value

See godbolt.org/z/d13jco

Again, there no conditional in the first version. The second version doesn't seems to be faster, but clearly the source code is less readable.

This leads me to your conclusion:

Modern compilers have built-in optimizers which can recognize some patterns and replace them with branch-less counterparts. But these are limited, so it's always better to write optimized code.

This was true 10 or 15 years ago.

I believe this is wrong today.

Modern compilers will optimize code much better than you. People are spending their life (often because it's their job) to write optimizers. They know many more tricks than you.

You have to understand how optimizers in compilers work: they look for a common patterns. When they found one, they optimize your code. Take the max function for instance: the "normal" version will be recognized for sure and optimized aggressively. With the other versions, compilers don't recognized anything. You try to be smart but at the end you fail because you don't let the compiler do its job properly.

Doesn't this mean we should not care when we write code?

Absolutely no.

But we have to understand where we can be smart and where the compiler will be smarter.

We should write clear code, compiler-friendly code. We should choose the correct containers (eg: linked list vs array list). We should find good architecture. We should choose the right algorithm (eg: not all sorting algorithm are the same).

Then we let the compiler works.

If performances are not good, then we use a profiler to see where we must optimize. We never optimize ahead with code that pretends to be "smart". We must understand why a particular piece of code, identified by the profiler, is slow. We optimize this piece only. We profile again.

Remember: premature optimization is the root of all evil.

Pierre Gradot • Feb 22 '21

I have tried to change the compiler to "x86-64 clang (trunk)" as the results are quite different for some snippets.

This confirms something important: don't assume, measure, try, measure again.

Some comments have been hidden by the post's author - find out more