abhinav the builder

Posted on Oct 15, 2020

From Source to Binaries: The journey of a C++ program

#cpp #programming #compilers

If you couldn't already tell, I love Bjarne, and by extension, I love C++. In this article, I go over how C++ compiles a program to binaries, and why I love C++. This is partially to understand how C++ works. Documentation helps me understand.

There are only two kinds of languages: the ones people complain about and the ones nobody uses.
― Bjarne Stroustrup, The C++ Programming Language

I got pretty inspired by HaoranWang's CRUST and thus wanted to write my own Compiler for C. I'll probably stick to Rust. Also, shoutout ShivyC, that's where I got this idea from.

Let's get started with the compilation pipeline!

What you see above is the compilation flow taken from NerdyElectronics.com.

For the purpose of this article, we will use a simple addition problem with predefined values.

//a.cpp program
#include <iostream>
using namespace std;

int main()
{
    int firstNumber = 2, secondNumber =4, sumOfTwoNumbers;

    // sum of two numbers in stored in variable sumOfTwoNumbers
    sumOfTwoNumbers = firstNumber + secondNumber;

    // Prints sum 
    cout << firstNumber << " + " <<  secondNumber << " = " << sumOfTwoNumbers;     

    return 0;
}

If you go back to the diagram, you can see we are presently on the preprocessing stage. Let's have a quick look at the Translation Unit. Translation Units is the input you give to the compiler, after it includes header files and expands macros.

You can get your Translation unit dump using the following command

g++ <filename>.cpp -E

The dump looks something like this

# 1 "a.cpp"
# 1 "<built-in>"
# 1 "<command-line>"
# 1 "a.cpp"
# 1 "c:\\mingw\\lib\\gcc\\mingw32\\8.2.0\\include\\c++\\iostream" 1 3
# 36 "c:\\mingw\\lib\\gcc\\mingw32\\8.2.0\\include\\c++\\iostream" 3

# 37 "c:\\mingw\\lib\\gcc\\mingw32\\8.2.0\\include\\c++\\iostream" 3

# 1 "c:\\mingw\\lib\\gcc\\mingw32\\8.2.0\\include\\c++\\mingw32\\bits\\c++config.h" 1 3
# 236 "c:\\mingw\\lib\\gcc\\mingw32\\8.2.0\\include\\c++\\mingw32\\bits\\c++config.h" 3

# 236 "c:\\mingw\\lib\\gcc\\mingw32\\8.2.0\\include\\c++\\mingw32\\bits\\c++config.h" 3
namespace std
{
  typedef unsigned int size_t;
  typedef int ptrdiff_t;

It's too damn long to post here (because it literally adds stdio header file, that's like 1k lines of code), but run it on your system if you're curious.

Assembly Code

Detouring for a bit, There's this grapevine that the closer you are to the hardware, the faster you will be. While there is a modicum of truth to this, often "slower" languages like Python are slow because they're interpreted or are memory hogs due to dynamic typing. There are plenty of Python-to-C/C++ compilers and there are plenty of projects that help you do Python "faster". Don't, for the love of God, develop something in a certain language because it is "closer to the hardware".

Anyway, now run this on the a.cpp file we had

gpp a.cpp -S

Now you'll have something like this

    .file   "a.cpp"
    .text
    .section .rdata,"dr"
__ZStL19piecewise_construct:
    .space 1
.lcomm __ZStL8__ioinit,1,1
    .def    ___main;    .scl    2;  .type   32; .endef
LC0:
    .ascii " + \0"
LC1:
    .ascii " = \0"
    .text
    .globl  _main
    .def    _main;  .scl    2;  .type   32; .endef
_main:
LFB1502:
    .cfi_startproc
    leal    4(%esp), %ecx
    .cfi_def_cfa 1, 0
    andl    $-16, %esp
    pushl   -4(%ecx)
    pushl   %ebp
    .cfi_escape 0x10,0x5,0x2,0x75,0
    movl    %esp, %ebp
    pushl   %ecx
    .cfi_escape 0xf,0x3,0x75,0x7c,0x6
    subl    $36, %esp
    call    ___main
    movl    $2, -12(%ebp)
    movl    $4, -16(%ebp)
    movl    -12(%ebp), %edx
    movl    -16(%ebp), %eax
    addl    %edx, %eax
    movl    %eax, -20(%ebp)
    movl    -12(%ebp), %eax
    movl    %eax, (%esp)
    movl    $__ZSt4cout, %ecx
    call    __ZNSolsEi
    subl    $4, %esp
    movl    $LC0, 4(%esp)
    movl    %eax, (%esp)
    call    __ZStlsISt11char_

Again, too damn long, try it out on your own system! This will be built for your target architecture. Find out what is your system's architecture as an exercise! Now, Each architecture has a different Instruction Set that is understood by its processor and your compiler splits this into processes:

Create an Abstract Syntax Tree
Generate architecture dependent instructions

Let's go over what that means.

Abstract Syntax Trees

Abstract Syntax tree is well, abstract from the target architecture. However, that's not where the "abstract" part of the term comes from. According to the Wikipedia, abstract refers to the fact that "it does not refer to every detail appearing in the real syntax, but rather just structural or content related details". ASTs are generated after syntax analysis. All programs can generate an AST. For our code, this is what the AST looks like

Here's how to do it yourself

g++ -fdump-tree-all-graph a.cpp -o a
dot -Tpng a.cpp.013t.cfg.dot -o a.png

This is built using GraphViz, install it for your command line. You can also copy paste the contents of a.cpp.013t.cfg.dot on any online GraphViz visualizer.

Object File and Linking

Object File has object code, that is essentially machine code (or some intermediate code). It is the "object" of compiling process, as you can see in this classic article. The reason I didn't use that fancy "Phases of Compiling Process" chart is because it kind of abstracts the real process of compilation. In due time, we will talk about that too. Create your object (.o) file using this, before we go ahead.

g++ a.cpp -c

Now, let's look at linking, which you do after you create your object files. The object files are linked together to create another object file that is executable. For this, let me divide the program into a header and a main CPP file.

//a.h
#include <stdio.h>
void printLinker()
{
    printf("Hello World");
}

Now, let's call that in another file

#include "a.h"
int main()
{
    printLinker();
    return 0;
}

Finally, to show the linking, let's create another source file

//We will name this a2.cpp
void printLinker();

Compiling a.cpp would give me a Hello World, as expected. But we need to see the linking, right?

g++ a.h -c
g++ a.cpp -c
g++ a2.cpp -c

Now, we are back to having a .obj and a .gch (precompiled header, if this is not found, the compiler looks for the header). Let's link!

gcc a.o a2.o -o a2.exe

Nice, you see how we just called the two object files and compiled them? Now we need to just run a2.exe, it would have printed Hello World.

./a2.exe
Hello World

Perfect. If you want to see what lies inside these files, you use nm tool.

nm a.o

You get the following

00000000 b .bss
00000000 d .data
00000000 r .eh_frame
         U ___main
00000015 T _main
         U _printf
00000000 r .rdata
00000000 r .rdata$zzz
00000000 t .text
00000000 T __Z7print_av

You can do the same for the other object file! It is pretty clear what the files contain, it is well sectioned.

Whew, that was a lot, that's how a compiler compiles. Let me just conclude real quick.

Preprocessing
Compilation
Assembly
Linking

That's about it, folks! See you around in part 2, which I will update here.

DEV Community

From Source to Binaries: The journey of a C++ program

Assembly Code

Abstract Syntax Trees

Object File and Linking

Top comments (0)

Read next

Negative Eigenvalues Boost Neural Networks' Memory and Pattern Recognition Abilities

Why Rewriting Everything in Rust Won’t Solve All Your Problems

8 Type of Load Balancing

Streamlining C++ Project Releases with CMake and Vcpkg