In this series, we are going to build a compiler. I assume that you have limited knowledge of working of compilers. It is a nice exercise, and is cool as well. So, you have come to the right place if you wanted to make a programming language or just make another compiler for some other language.
Where to start?
First step, define your language if you want to create a new one(if you are implementing an existing language, say another C compiler like GCC, you may skip it).
Second step, choose the language in which you will create a new language. Choose a language you are somewhat comfortable with. However, it is very helpful if your language of choice has pattern matching. [My choice: Rust]
Third step, start with the implementation.
How does a compiler work?
First a compiler will read your file, then it will convert the code you have written into tokens. A token is the tiniest component of any program. It can be a comma, semicolon, an identifier like "my_variable" or some keyword like "if" or some operator like "+". This part is called lexing.
Then, it will create a data structure called the AST(Abstract Syntax Tree) from the tokens, representing your program logic. This process is called parsing.
Note: After parsing, some compilers check for unused variables, unreachable code or some other bugs such as referencing an undeclared function or variable and so on (This is called static analysis).
After the AST is created, we will convert each node of the tree(AST) into some meaningful code that will be executable. In jargon, we transverse the AST to translate it into some low level code.
What is 'low level code'?
Many of you might have used Java or Python. The compiler (they have both a compiler and an interpreter, but I will get into it later), will translate the code you typed into a stream of tokens and then into an AST and finally, into a bytecode. This bytecode is then executed by a 'virtual machine'. The VM(Virtual Machine) is basically a program that understands the bytecode (returned after the AST is translated)and runs according to that bytecode. To make it clearer, I'll show an example. Say someone wrote a program.
print("hello world")
The Python compiler will see this and turn it into an AST which will be run by the python interpreter. The interpreter will see the bytecode and say 'Okay, I have to print something on the screen, let me do it.'
This mechanism makes the code platform independent because the programmer will (usually)not have to worry if her/his code is going to run on Windows or Ubuntu/Fedora/Arch or Mac or some other operating system (I use Arch btw).
However, the AST can also be translated into machine code or assembly as done by C or C++ or Rust compilers.
Now that you know how the code is translated into executable, you should also know that this process of translating of AST into bytecode or machine code is called code generation or 'codegen'.
Any suggestions are welcome. I will be implementing the compiler in the upcoming posts.
Top comments (1)
I have already made a compiler based on LLVM. I will be adding them in the later posts. I have the lexer, parser and the LLVM driver ready. Llama-lang