loading...
Cover image for From Source to Binaries: The journey of a C++ program

From Source to Binaries: The journey of a C++ program

abhinavmir profile image Abhinav Srivastava ・5 min read

If you couldn't already tell, I love Bjarne, and by extension, I love C++. In this article, I go over how C++ compiles a program to binaries, and why I love C++. This is partially to understand how C++ works. Documentation helps me understand.

There are only two kinds of languages: the ones people complain about and the ones nobody uses.
― Bjarne Stroustrup, The C++ Programming Language

I got pretty inspired by HaoranWang's CRUST and thus wanted to write my own Compiler for C. I'll probably stick to Rust. Also, shoutout ShivyC, that's where I got this idea from.

Let's get started with the compilation pipeline!

Alt Text

What you see above is the compilation flow taken from NerdyElectronics.com.

For the purpose of this article, we will use a simple addition problem with predefined values.

//a.cpp program
#include <iostream>
using namespace std;

int main()
{
    int firstNumber = 2, secondNumber =4, sumOfTwoNumbers;

    // sum of two numbers in stored in variable sumOfTwoNumbers
    sumOfTwoNumbers = firstNumber + secondNumber;

    // Prints sum 
    cout << firstNumber << " + " <<  secondNumber << " = " << sumOfTwoNumbers;     

    return 0;
}
Enter fullscreen mode Exit fullscreen mode

If you go back to the diagram, you can see we are presently on the preprocessing stage. Let's have a quick look at the Translation Unit. Translation Units is the input you give to the compiler, after it includes header files and expands macros.

You can get your Translation unit dump using the following command

g++ <filename>.cpp -E
Enter fullscreen mode Exit fullscreen mode

The dump looks something like this

# 1 "a.cpp"
# 1 "<built-in>"
# 1 "<command-line>"
# 1 "a.cpp"
# 1 "c:\\mingw\\lib\\gcc\\mingw32\\8.2.0\\include\\c++\\iostream" 1 3
# 36 "c:\\mingw\\lib\\gcc\\mingw32\\8.2.0\\include\\c++\\iostream" 3

# 37 "c:\\mingw\\lib\\gcc\\mingw32\\8.2.0\\include\\c++\\iostream" 3

# 1 "c:\\mingw\\lib\\gcc\\mingw32\\8.2.0\\include\\c++\\mingw32\\bits\\c++config.h" 1 3
# 236 "c:\\mingw\\lib\\gcc\\mingw32\\8.2.0\\include\\c++\\mingw32\\bits\\c++config.h" 3

# 236 "c:\\mingw\\lib\\gcc\\mingw32\\8.2.0\\include\\c++\\mingw32\\bits\\c++config.h" 3
namespace std
{
  typedef unsigned int size_t;
  typedef int ptrdiff_t;

Enter fullscreen mode Exit fullscreen mode

It's too damn long to post here (because it literally adds stdio header file, that's like 1k lines of code), but run it on your system if you're curious.

Assembly Code

Detouring for a bit, There's this grapevine that the closer you are to the hardware, the faster you will be. While there is a modicum of truth to this, often "slower" languages like Python are slow because they're interpreted or are memory hogs due to dynamic typing. There are plenty of Python-to-C/C++ compilers and there are plenty of projects that help you do Python "faster". Don't, for the love of God, develop something in a certain language because it is "closer to the hardware".

Anyway, now run this on the a.cpp file we had

gpp a.cpp -S
Enter fullscreen mode Exit fullscreen mode

Now you'll have something like this

    .file   "a.cpp"
    .text
    .section .rdata,"dr"
__ZStL19piecewise_construct:
    .space 1
.lcomm __ZStL8__ioinit,1,1
    .def    ___main;    .scl    2;  .type   32; .endef
LC0:
    .ascii " + \0"
LC1:
    .ascii " = \0"
    .text
    .globl  _main
    .def    _main;  .scl    2;  .type   32; .endef
_main:
LFB1502:
    .cfi_startproc
    leal    4(%esp), %ecx
    .cfi_def_cfa 1, 0
    andl    $-16, %esp
    pushl   -4(%ecx)
    pushl   %ebp
    .cfi_escape 0x10,0x5,0x2,0x75,0
    movl    %esp, %ebp
    pushl   %ecx
    .cfi_escape 0xf,0x3,0x75,0x7c,0x6
    subl    $36, %esp
    call    ___main
    movl    $2, -12(%ebp)
    movl    $4, -16(%ebp)
    movl    -12(%ebp), %edx
    movl    -16(%ebp), %eax
    addl    %edx, %eax
    movl    %eax, -20(%ebp)
    movl    -12(%ebp), %eax
    movl    %eax, (%esp)
    movl    $__ZSt4cout, %ecx
    call    __ZNSolsEi
    subl    $4, %esp
    movl    $LC0, 4(%esp)
    movl    %eax, (%esp)
    call    __ZStlsISt11char_
Enter fullscreen mode Exit fullscreen mode

Again, too damn long, try it out on your own system! This will be built for your target architecture. Find out what is your system's architecture as an exercise! Now, Each architecture has a different Instruction Set that is understood by its processor and your compiler splits this into processes:

  1. Create an Abstract Syntax Tree
  2. Generate architecture dependent instructions

Let's go over what that means.

Abstract Syntax Trees

Abstract Syntax tree is well, abstract from the target architecture. However, that's not where the "abstract" part of the term comes from. According to the Wikipedia, abstract refers to the fact that "it does not refer to every detail appearing in the real syntax, but rather just structural or content related details". ASTs are generated after syntax analysis. All programs can generate an AST. For our code, this is what the AST looks like
Alt Text

Here's how to do it yourself

g++ -fdump-tree-all-graph a.cpp -o a
dot -Tpng a.cpp.013t.cfg.dot -o a.png
Enter fullscreen mode Exit fullscreen mode

This is built using GraphViz, install it for your command line. You can also copy paste the contents of a.cpp.013t.cfg.dot on any online GraphViz visualizer.

Object File and Linking

Object File has object code, that is essentially machine code (or some intermediate code). It is the "object" of compiling process, as you can see in this classic article. The reason I didn't use that fancy "Phases of Compiling Process" chart is because it kind of abstracts the real process of compilation. In due time, we will talk about that too. Create your object (.o) file using this, before we go ahead.

g++ a.cpp -c
Enter fullscreen mode Exit fullscreen mode

Now, let's look at linking, which you do after you create your object files. The object files are linked together to create another object file that is executable. For this, let me divide the program into a header and a main CPP file.

//a.h
#include <stdio.h>
void printLinker()
{
    printf("Hello World");
}
Enter fullscreen mode Exit fullscreen mode

Now, let's call that in another file

#include "a.h"
int main()
{
    printLinker();
    return 0;
}
Enter fullscreen mode Exit fullscreen mode

Finally, to show the linking, let's create another source file

//We will name this a2.cpp
void printLinker();
Enter fullscreen mode Exit fullscreen mode

Compiling a.cpp would give me a Hello World, as expected. But we need to see the linking, right?

g++ a.h -c
g++ a.cpp -c
g++ a2.cpp -c
Enter fullscreen mode Exit fullscreen mode

Now, we are back to having a .obj and a .gch (precompiled header, if this is not found, the compiler looks for the header). Let's link!
Alt Text

gcc a.o a2.o -o a2.exe
Enter fullscreen mode Exit fullscreen mode

Nice, you see how we just called the two object files and compiled them? Now we need to just run a2.exe, it would have printed Hello World.

./a2.exe
Hello World
Enter fullscreen mode Exit fullscreen mode

Perfect. If you want to see what lies inside these files, you use nm tool.

nm a.o
Enter fullscreen mode Exit fullscreen mode

You get the following

00000000 b .bss
00000000 d .data
00000000 r .eh_frame
         U ___main
00000015 T _main
         U _printf
00000000 r .rdata
00000000 r .rdata$zzz
00000000 t .text
00000000 T __Z7print_av
Enter fullscreen mode Exit fullscreen mode

You can do the same for the other object file! It is pretty clear what the files contain, it is well sectioned.

Whew, that was a lot, that's how a compiler compiles. Let me just conclude real quick.

  • Preprocessing
  • Compilation
  • Assembly
  • Linking

That's about it, folks! See you around in part 2, which I will update here.

Discussion

pic
Editor guide