JayZho

Posted on May 7, 2022 • Edited on May 11, 2022

A poor introduction to compilers.(still updating)

#compiler #programming

What is a compiler

The job of a compiler is to turn source code that's written by humans:

int i = 0;
i = 10 + 20;

into machine code(a bunch of 0s and 1s) that's understood by computers in order to run the program: (say, 8-bit instructions)

01110010
01010111
...

Usually, the compilation process involves converting the source code into intermediate code called Assembly Code first, then further turning the assembly code into machine code.
The second step is relatively straight forward as it's mostly just a 1-to-1 replacement from assembly instruction to its binary representation.
This blog is largely about how we go from source code to assembly code.

From Source Code to Assembly

The big picture of this process is as shown below:

Step1: Lexical Analysis

-- turn the source code into a bunch of tokens.

In order to discuss lexical analysis, we should first have a look at what a programming language formally is.

Very much like natural languages(e.g. English), a programming language is just a set of strings(called "tokens"), where each string/token is a sequence of symbols allowed in this programming language.
For example, in English, the set of allowed symbols are roughly the 26 alphabets plus the standard English punctuation. Therefore,

How are you?

is a valid English sentence, while

H@w are you?

isn't, since "@" isn't a valid symbol defined in the language.

A programming language normally consists of 5 kinds of tokens:

Identifiers: variable names, function names are both identifiers.
Keywords: reserved words such as "if", "while" and "return".
Operators: +, -, <, >,
Separators: brackets, comma and semicolon.
Literals: source representation of a value

The job of this lexical analysis step is to scan through the source code and produce a set of tokens. For example, for a program that only has one line of code:

int a = 0;

The list of tokens produced after lexical analysis would be something like:

[
  Keyword("int"),
  Identifier("a"),
  Operator("="),
  Literal("0"),
  Separator(";")
]

Step2: Syntax Analysis

-- given a list of tokens, build a tree to represent the structure of this program.

In order for a sentence to make sense, it must be both **syntactically **and **semantically **correct. The syntax of a program deals with the program structure, while the semantics of a program deals with the program logic.
For example, the following sentence would be considered syntactically incorrect in general:

Tom likes.

since the verb "like" is missing its object.
A sentence could also be syntactically correct while being semantically incorrect. The sentence

Tom sees sounds.

obviously doesn't make sense, but it's syntactically correct as it consists of the subject, the verb and the object of a sentence.

The syntax rules of a language are called "Grammar", which defines what the structure of a language could be.

DEV Community

A poor introduction to compilers.(still updating)

What is a compiler

From Source Code to Assembly

Step1: Lexical Analysis

-- turn the source code into a bunch of tokens.

Step2: Syntax Analysis

-- given a list of tokens, build a tree to represent the structure of this program.

Top comments (0)

Read next

My Experience with the OSS.GG Hackathon: Diving into Open Source with Twenty, Unkey, and Formbricks

Brisa Framework Presentation - October 2024

Sealed Interfaces vs. Sealed Classes in Kotlin: When and Why to Use Each

Best Websites to Find Remote Jobs in 2025