Compilers, and how to write one !

#compilers #rust #llvm

Hello and welcome to a new blog post!

Today we will talk about compilers and how you can write one in Rust.

Programming Languages

There are two types of programming languages: compiled and interpreted. Programs written in compiled languages need to be compiled before they can be run as a binary on a computer, while programs written in interpreted languages are run directly on a computer using just-in-time compilation (JIT). Examples of compiled languages include C, C++, Java, C#, Go, etc., while examples of interpreted languages include Python, Ruby, JavaScript, etc.

Compilers

A compiler is a program that takes a human-readable programming language and converts it to machine code that can be understood by the computer. As we know, computers can only understand zeros and ones, so how does a compiler actually work?

LLVM

You may have heard of LLVM, which is a tool that simplifies the process of building compilers. It is written in C++ and provides a collection of modular and reusable compiler and toolchain technologies. LLVM serves as the middle layer of a complete compiler system, taking intermediate representation (IR) code from a compiler and emitting optimized IR. This new IR can then be converted and linked into machine-dependent assembly language code for a target platform. While LLVM is a powerful and widely-used tool in the compiler world, in this article, we will focus on the fundamentals of compilers and provide examples of how to write parts of a compiler in Rust.

The Compiler

A simple compiler can be broken down into three main parts:

A Parser (or Lexer)
A Transformer
A Code Generator

The Parser

The parser is the part of the compiler that tokenizes the code and converts it into a list of tokens.

Tokenization, also known as lexical analysis, is the process of breaking down a program into a list of tokens. Each token is a small, single unit of code that the compiler can understand. Let's take a look at a simple Rust program:

fn main() {
    println!("Hello, world!");
}

In this Rust program, we have a function named main, and inside the function, there is a println! macro that prints the string "Hello, world!" to the console. After tokenization, the program will be broken down into a list of tokens. For example, fn is a function keyword, main is the name of the function, and the parentheses are the opening and closing parentheses of the function. Similarly, the opening and closing curly braces, println is the println macro, and the parentheses are the opening and closing parentheses of the macro.

After tokenizing the code, the parser converts the list of tokens into a tree of nodes called an Abstract Syntax Tree (AST). The AST is an abstract representation of the code. This step is called syntactic analysis.

The AST for the program above will look something like this:

AST {
    kind: "Program",
    children: [
        AST {
            kind: "Function",
            children: [
                AST {
                    kind: "Identifier",
                    value: "main",
                    children: []
                },
                AST {
                    kind: "Block",
                    children: [
                        AST {
                            kind: "Println",
                            children: [
                                AST {
                                    kind: "String",
                                    value: "Hello, world!",
                                    children: []
                                }
                            ]
                        }
                    ]
                }
            ]
        }
    ]
}

The Transformer

After parsing the code, we have an AST. However, the AST is not easily understandable by the computer, and it is also a deeply nested structure. Therefore, we need to transform the AST into a

more readable structure called a Syntax Tree.

The Transformer can be broken down into two main parts or functions:

Traversers
Visitors

The Traversers are functions that traverse the AST in a depth-first manner. For example, in the AST mentioned above, the Traverser will apply a function to the root node (Program node) and then apply the function to its children, and so on, until it reaches the last child.

A Traverser function can be implemented using the Visitor pattern, which is a pattern that allows you to implement a function that works on a node and its children and then apply the function to the node and its children.

Visitors are usually objects with methods that accept a node and return a value. Let's implement a simple Transformer that takes an AST and transforms it into a syntax tree by passing it to a Traverser.

use std::collections::HashMap;

struct Visitor {
    nodes: HashMap<String, String>
}

struct Traverser {
    visitor: Visitor
}

struct Node {
    kind: String,
    children: Vec<Node>
}

impl Visitor {
    fn new() -> Visitor {
        Visitor {
            nodes: HashMap::new()
        }
    }

    fn visit(&mut self, node: &Node) {
        self.nodes.insert(node.kind.clone(), node.kind.clone());
        for child in &node.children {
            self.visit(child);
        }
    }
}

impl Traverser {
    fn new() -> Traverser {
        Traverser {
            visitor: Visitor::new()
        }
    }

    fn traverse(&mut self, node: &Node) {
        self.visitor.visit(node);
    }
}

struct SyntaxTree {
   kind: String,
   body: Vec<Node>
}

fn Transformer(AST: &AST) -> SyntaxTree {
    let mut syntax_tree = SyntaxTree {
        kind: "Program".to_string(),
        body: Vec::new()
    };
    let mut traverser = Traverser::new();
    traverser.traverse(AST);
    syntax_tree.body = traverser.visitor.nodes.clone();
    syntax_tree
}

The Code Generator

The Code Generator is the part of the compiler that generates machine code from the syntax tree. The code generator calls itself recursively to generate the code from the syntax tree into a single string, which represents the machine code.

Here is an example of how you can implement the code generator function in Rust:

fn generate_code(syntax_tree: &SyntaxTree) -> String {
    let mut code = String::new();
    match syntax_tree.kind.as_str() {
        "Program" => {
            for child in &syntax_tree.body {
                code.push_str(&generate_code(child));
            }
        },
        "Function" => {
            code.push_str("fn ");
            code.push_str(&syntax_tree.body[0].kind);
            code.push_str("(");
            for child in &syntax_tree.body[1].body {
                code.push_str(&child.kind);
                code.push_str(", ");
            }
            code.pop();
            code.pop();
            code.push_str(") {\n");
            for child in &syntax_tree.body[2].body {
                code.push_str(&generate_code(child));
            }
            code.push_str("}\n");
        },
        "Block" => {
            for child in &syntax_tree.body {
                code.push_str(&generate_code(child));
            }
        },
        "Println" => {
            code.push_str("println!(");
            code.push_str(&syntax_tree.body[0].kind);
            code.push_str(");\n");
        },
        "String" => {
            code.push_str("\"");
            code.push_str

(&syntax_tree.body[0].kind);
            code.push_str("\"");
        },
        "Identifier" => {
            code.push_str(&syntax_tree.body[0].kind);
        },
        _ => {
            code.push_str("<unknown>");
        }
    }
    code
}

After implementing the parser, the transformer, and the code generator, you can use composition to tie everything together, and voila, you have your own compiler.

Conclusion

Compilers are essential tools in programming, and while you don't necessarily need to know how to write a compiler, learning how compilers work can help you understand your tools better and improve your overall programming skills.