loading...

The C Compilation Pipeline - Basics

imdebdut profile image Debdut Chakraborty ・11 min read

Preface

C is one of the most popular and arguably the most powerful programming languages out there. Generally, most tutorials and books only talk about what the C syntax looks like, the standard library, some example projects. That's it. Therefore not many C programmers know how writing something like int main(){ return 0; }, gives you a file that can be run like a program. And here's the important thing, to write meaningful programs, efficient programs, the programmers must not just know the syntax and standard library, but also the process of how a text file (source code) gets converted to this unreadable file that's executable. Through this series, I intend on explaining as much of this topic as possible. Now I don't want to waste any more time talking about the specifics at the top of this article, so I'll leave that for the end. The following article will give you the basic idea of how a C source is translated, excluding the specifics for later chapters.

The compilation flow/pipeline

Consider the following C program,

#include <stdio.h>
int main(void){
    printf("Hello world\n");
    return 0;
}

This is like the first chapter of any Book/Tutorial. You generally compile this program with something like the following command, granted you're not using an IDE

gcc main.c -o main

While this process takes less than a second, behind the curtains there are four specific filters the program went through before you got the main executable main. You can think of the C compilation pipeline as a railway path which consists of 4 stations. To get to the final station you gotta go through the first 3. If the train fails to reach or cross any of those stations, you won't get to the final station. Let me list the stations (steps) of our piprline below

  1. Preprocessing
  2. Compiling
  3. Assembling
  4. Linking

Take a closer look at step 2. That step's named "compiling", but we talk about the full process to be "compilation", don't we? I even named this series "Compilation Pipeline". Well by "compiling" we actually mean "building" a C program. Among those 4 steps, the compilation step is the most important, that's why the whole process is named after that. Four components are responsible for these four steps. The names of those four components are pretty predictable too, they're respectively

  1. Preprocessor
  2. Compiler
  3. Assembler
  4. Linker

Before moving forward, a couple of things to note:

  1. C programs are cross-platform, not portable

    Being cross-platform means the source code can be compiled in different platforms with minimum changes (if any). The object files (or binary files) are not executable on different platforms. On the other hand, portable means you can just copy over the final product on different platforms, and they'll just work. For example, Java Byte Code is portable. Or Python programs are portable (while the Python interpreter isn't).

  2. Platforms don't just mean host operating system

    When I mention "platform", I'm not addressing the underlying operating system. A platform consists of a hardware part and a software part. Hardware part, as in which architecture are we building our project for? It can be AMD64 or IA-32 or ARM. For the software part, this defines which operating system we're running. Or which kind of operating system we're running. A C program depends on the platform, this is not a 'either this or that' situation. This will be clearer in a moment.

My host

As there are going to be examples running around this series, I should let you know about the system I'm running them on, or testing them. My platform consists of the following

  • AMD 64 bit processor.
  • A 64 bit GNU/Linux Operating System.

As for the compiler, I'll be using gcc because it's installed by default, there're so many resources available online for this compiler, and this one's available for almost all the platforms.

If you don't have the same platform, don't worry. Every platform has its own C compiler.

Building the project skeleton

What's a better way of teaching something other than through examples? Here we'll build an example project, and go through the compilation tunnel to see what's happening for ourselves.

Commands used here are done on a Linux based system. They should just work on any *nix system (Like OSX). If you're using Windows, you'll have to change these around.

First, just create a directory named c-pipeline

mkdir -p ~/c-pipeline/src && cd ~/c-pipeline/src

Here I've created the directory on $HOME, you can create it anywhere else. I've also created an src directory one down. Now we need three files, two source files, one header file.

  • Header file: Header files are files with .h extension. These files don't contain any executable C instructions. They only contain function declarations or data type definitions (Like structs or enums).
  • Source files: Source files are ones that contain executable contents, or the program logic. These end with a .c extension.

While building a C project, you'll need to keep in mind that we only compile source files, not headers. So you won't see gcc something.h -o something happening anywhere. If you're doing something like that, change that. While the extensions don't matter much in the *nix world, they're there for our understanding of what is what. Okay, enough theory, let's now write our programs down, copy-paste the following in a file, and name it operations.h.

int add(int, int); // Adds two integers
int sub(int, int); // Subtracts two integers
int mul(int, int); // Multiplies two integers
int div(int, int); // Divides two integers

These function declarations are pretty self-explanatory, yet I explained them in the comments. Next, we'll create a file named main.c, this file will contain our main function so it only makes sense naming it main.c. Paste the following in there

#include <stdio.h>
#include "operations.h"
int main(void){
    printf("Adding 10 and 20: %d\n", add(10, 20));
    printf("Subtracting 10 from 20: %d\n", sub(20, 10));
    printf("Multiplying 10 and 20: %d\n", mul(10, 20));
    printf("Dividing 20 by 10: %d\n", div(20, 10));
    return 0;
}

Next, create another source file named operations.c

int add(int a, int b){
    return a+b;
}
int sub(int a, int b){
    return a-b;
}
int mul(int a, int b){
    return a*b;
}
int div(int a, int b){
    return a/b;
}

So basically our program will sequentially perform addition, subtraction, multiplication, and division on 10 and 20. It's a very simple and basic program, but now we'll see what happens when we pull down the curtains in the hope of getting the executable binary without having to be concerned about any of the details.

Let's go through the components step by step.

  1. Preprocessor:

    The first component in the pipeline is the preprocessor. The preprocessor processes the preprocessing directives such as #define, #include, etc. The preprocessor also strips out all the comments in a C file. One thing to keep in mind is that the preprocessor has no knowledge of C grammar. All it does is prepare the translation units.

    A translation unit is a logical unit of a C code. It's one big chunk of C code generated by the preprocessor.

    In gist, the preprocessor resolves the preprocessing directives and rips off the comments. In our example, with main.c, the preprocessor will simply copy and paste all the contents of stdio.h and operations.h. That's what the #include directive does. Let's test this out a bit differently. Create a separate file and name it dummy.c. In this file, leave the following contents

    #include "operations.h"
    // This is a random line.
    Note this is not a comment.
    

    We can make our compiler stop after the preprocessing step by using the -E option for gcc. Let's do that

    gcc -E dummy.c -o dummy.i
    

    Why .i? We'll see in a moment. For now, take a note, that we didn't see an error. The third line in our dummy.c file is a random English line, not a valid C statement. This is because as I previously stated, the preprocessor isn't aware of the language-specific grammars. Next look at the contents of the file. It should look like the following

    # 1 "dummy.c"
    # 1 "<built-in>"
    # 1 "<command-line>"
    # 31 "<command-line>"
    # 1 "/usr/include/stdc-predef.h" 1 3 4
    # 32 "<command-line>" 2
    # 1 "dummy.c"
    # 1 "operations.h" 1
    int add(int, int);
    int sub(int, int);
    int mul(int, int);
    int div(int, int);
    # 2 "dummy.c" 2
    
    Note this is not a comment.
    

    Ignore the lines starting with '#' for now. First notice all the comments are gone, as I said previously. Also, everything inside operations.h is now inside of this file. While our dummy.c isn't a valid C program, if it were, this dummy.i was going to be our translation unit or compilation unit.

    The .i extension indicates that our file has already gone through the preprocessor. If we pass a file with .i extension to gcc, it'll skip the compilation stage. As #include is not part of C grammar, renaming the file dummy.c to dummy.i and trying to compile it with gcc dummy.i will give you the following error

    dummy.i:1:1: error: stray ‘#’ in program
        1 | #include "operations.h"
          | ^
    dummy.i:1:10: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before string constant
        1 | #include "operations.h"
          |          ^~~~~~~~~~~~~~
    

    Now that we have the basic idea of what a preprocessor is or what it does, it's time to get our main files processed (preprocessed?). The following command will take care of that

    for c in main.c operations.c; do gcc -E $c -o ${c/.c/.i}; done
    
  2. Compiler:

    So we have our translation units ready. Now it's time for compilation. The second and most important component of this pipeline is the compiler. A compiler takes in a translation unit or a valid C code and translates that (compiles) into host architecture-specific assembly language. While assembly is pretty near the hardware, it's still somewhat human-readable.

    Note I said "architecture" specific, not platform-specific. That's because compilation depends only on the underlying architecture, not on the host operating system. Before moving forward, note that we only compile the source files. This is important. Although we have .i files now, generally we let the compiler do the preprocessing automatically. So keep in mind that we never do something like gcc something.h.

    There is an option for gcc that lets us only complete the compilation stage. It's -S (capital 'S'). The following command compiles our source files,

    gcc -S main.i operations.i
    

    Now you should have two files, named main.s and operations.s. These are the assembly files, results of the compilation step. I'll have operations.s's contents appended

        .file   "operations.c"
        .text
        .globl  add
        .type   add, @function
    add:
    .LFB0:
        .cfi_startproc
        pushq   %rbp
        .cfi_def_cfa_offset 16
        .cfi_offset 6, -16
        movq    %rsp, %rbp
        .cfi_def_cfa_register 6
        movl    %edi, -4(%rbp)
        movl    %esi, -8(%rbp)
        movl    -4(%rbp), %edx
        movl    -8(%rbp), %eax
        addl    %edx, %eax
        popq    %rbp
        .cfi_def_cfa 7, 8
        ret
        .cfi_endproc
        ...
        ...
        ...
    .LFE0:
        .size   add, .-add
        .globl  sub
        .type   sub, @function
    .LFE3:
        .size   div, .-div
        .ident  "GCC: (GNU) 10.2.0"
        .section    .note.GNU-stack,"",@progbits
    

    As you can notice, I had to strip it down to reduce line counts. Now I'm not an assembly language expert, so I can't walk you through these files. Neither is that the topic of this article.

  3. Assembler:

    Now we have our compiled assembly code, so what's next? Next, we need to convert this to something that the computer actually understands, which for sure isn't what we have in the .s files. The third component in this pipeline is the assembler, which coincidentally "assembles" the assembly source code to some form of an object file. What's an object file? An object file is a file that contains machine code generated from the assembly source code. As you might've already guessed, we can make gcc only perform the assembly step too. We do that using -c flag.

    gcc -c main.s operations.s
    

    Here two object files are created, these are called relocatable object files, or intermediate object files. These are yet to be executable. Relocatable object files do not contain any symbol definitions.

    Why are these object files called Relocatable Object files? That's a discussion for later. For now, just know that it's called that.

    Note that although we've compiled and assembled both of the files together here, that's not what you should be doing in the future. The -c flag completes the first three stages, and generates the respeective object files. In an actual project we won't use -E, -S and -c separately, we'll just use -c flag, on each of the source files. We compile each source file individually. So if we were to do that in this example project, we'd have compiled it like so

    gcc -c main.c
    gcc -c operations.c
    
  4. Linker:

    Finally, we're here. The final step/component in the pipeline. So what is a linker? A linker is a component that links multiple relocatable object files and generates one executable binary (object) file. Okay to understand this better, we need to step back a little bit. Remember in our project, we created 3 files in total? One, the header file that contained the function declarations, another that included the main function, and finally the file that contains the function definitions (add, sub, mul & div). Now jump to just before our compilation step, we had our preprocessed files ready, didn't we? Okay, inspect them. Start with main.i. Now I need you to jump to the very end. The first 700 or so lines were from the file stdio.h (In my case). After that, we have the contents of operations.h. But they're ONLY function declarations, right? What does add do? main doesn't know. What does sub do? main doesn't know. Understand that we only need to declare a function before using it. Not just function, but everything declarable. That's why we didn't get any error until now. But what if we were to build off of just main.c. Try it. You'll experience something like the following

    [debdut@pc compilation-pipeline]$ gcc main.c
    /usr/bin/ld: /tmp/ccgMfUHf.o: in function `main':
    main.c:(.text+0xf): undefined reference to `add'
    /usr/bin/ld: main.c:(.text+0x31): undefined reference to `sub'
    /usr/bin/ld: main.c:(.text+0x53): undefined reference to `mul'
    collect2: error: ld returned 1 exit status
    

    "undefined reference", means while those functions are declared, the compiler, or if being specific, the linker couldn't find their definitions. They're defined inside operations.c. This is why we've been processing, compiling, and assembling that file simultaneously. A linker simply links these multiple object files (.o files), filling in the gaps (declarations and definitions), and finally generates an executable binary file. Now gcc uses the ld linker. You can see in the output above, the first line /usr/bin/ld. So let's try using that for linking our object files.

    ld main.o operations.o
    

    Here's what I got,

    ld: warning: cannot find entry symbol _start; defaulting to 0000000000401000
    ld: main.o: in function `main':
    main.c:(.text+0x22): undefined reference to `printf'
    ld: main.c:(.text+0x44): undefined reference to `printf'
    ld: main.c:(.text+0x66): undefined reference to `printf'
    ld: main.c:(.text+0x88): undefined reference to `printf'
    

    This is because while we do have the definitions of our functions in operations.o, we don't have of printf. You might see some other errors too. This is why I recommend using gcc for this. Just pass in the object names, it'll link them and generate an executable binary file.

    gcc main.o operations.o
    

    Now you should see a file named a.out on your current directory. Execute that like this ./a.out. Also, note that your program doesn't require the main function to pass the first 3 steps in the pipeline. Only the linker needs it to mark where your program should start executing.


That concludes the end of this article "C compilation pipeline - Basics". We've written and built a small, quite useless C program. But being small and useless gave us the oppurtunity to inspect each of the stages in the compilation pipeline with clarity. I hope you now have a better understanding of how a program is translated from pure text to executable binary. In the coming chapters, I'll go in-depth of each of those components, to give you a better understanding of the process. If you'd like to see that happening, leave a comment down below.

Posted on by:

imdebdut profile

Debdut Chakraborty

@imdebdut

This is not the part that I'm good at.

Discussion

pic
Editor guide