Simmypeet

Posted on Dec 8, 2022

Crafting a Compiler in Rust: Lexical Analysis

#rust #compilers #tutorial #computerscience

Let's create a project for the compiler. It'll be named pernix-lang. Inside the folder pernix-lang, the Cargo Workspace project locates here. I'll separate each different compiler phase into an individually distinct library crate.

The Pernix Project Crate

Before going into writing the compiler, let's create a crate called pernix_project.

cargo new --lib compiler/pernix_project

Add the newly created crate path in the Cargo.toml workspace file.

[workspace]
members = [
    "compiler/pernix_project",
]

The workspace should look something like this:

├── Cargo.toml
├── compiler/pernix_project
│   ├── Cargo.toml
│   └── src
│       └── src.rs

This crate contains all the code related to the language's project
such as source files, directories, targets, and libraries.

Let's write code for the pernix_project crate. I'll create a new module inside the crate named source_code. This module contains the SourceCode and SourcePosition structs; these structs involve source file inputs.

The SourceCode struct manages the content of the source file. It provides utility functions such as inspecting the source file's content by the line number.
The SourcePosition consists of two fields: line and column numbers.

Full Code Listing of source_code.rs

/// Represents a source code
pub struct SourceCode {
    source_code: String,
    source_name: String,
    new_line_ranges: Vec<Range<usize>>,
}

/// Represents a particular position in the source code, consisting of column
/// and line
#[derive(Debug, Clone, Copy, Default, PartialEq, Eq)]
pub struct SourcePosition {
    pub line: usize,
    pub column: usize,
}

Here's the snippet of the code of the SourceCode struct in action.

#[test]
fn test_line() {
    let source_code = SourceCode::new(
        "Hello, world!\n".to_string()
            + "My Name Is Pernix\n"
            + "I am a programming language\n"
            + "I am written in Rust\n"
            + "\n",
        "test".to_string(),
    );

    assert_eq!(source_code.line(1).unwrap(), "Hello, world!");
    assert_eq!(source_code.line(2).unwrap(), "My Name Is Pernix");
    assert_eq!(source_code.line(3).unwrap(), "I am a programming language");
    assert_eq!(source_code.line(4).unwrap(), "I am written in Rust");
    assert_eq!(source_code.line(5).unwrap(), "");
}

The Pernix Lexer Crate

Before doing the actual code, let's discuss more in-depth Lexical Analysis. Generally speaking, Lexical Analysis is the phase where the source code is fed to a lexer or scanner and converted into a stream of tokens. When working with a lexer, you will likely stumble across these terms: lexeme and token.

The lexeme is a term used to describe a sequence of characters matched by a well-defined pattern or rule.
The token is a term used to describe a syntactic category that constitutes the lexeme.

For example, the following code would produce a token stream like this:

int main() {
    return 0;
}

TokenKind	Lexeme
IDENTIFIER	int
IDENTIFIER	main
PUNCTUATOR	(
PUNCTUATOR	)
PUNCTUATOR	{
KEYWORD	return
LITERAL	0
PUNCTUATOR	;
PUNCTUATOR	}

Now that we've understood the basics of the Lexical Analysis phase, Let's create a library crate that contains the lexer for the compiler, pernix_lexer.

cargo new --lib compiler/pernix_lexer

[workspace]
members = [
    "compiler/pernix_project",
    "compiler/pernix_lexer",
]

Let's create a new module in this crate named token. This module contains the code related to the lexical token.

Now, let's write an enum with all token-kind values.

/// Enumeration containing all patterns of token
#[derive(Debug, Clone, PartialEq, Eq)]
pub enum TokenKind {
    Identifier,
    Space,
    Comment,
    Keyword(Keyword),
    LiteralConstant,
    Punctuator(char),
    EndOfFile,
}

There're seven kinds of tokens defined here; Let's discuss each of them one by one.

An Identifier is a string of characters used to give a name to a particular symbol or declaration.
A Space represents a string of whitespaces defined in the source file, including whitespaces.
A Comment represents a text in the source file that doesn't alter the program's behavior but is meant to be an annotation for the programmer.
A Keyword is a string, similar to an identifier, but the language reserves since it often has a special meaning that commands particular program behavior.
A Literal Constant is any hard-coded value in the source file, including numbers, boolean-value, and strings.
A Punctuator is a token with syntactic and semantic meaning to the compiler, but the exact significance depends on the context.
A EndOfFile signifies the end of the token stream.

Now let's write a Token struct, the data structure that the lexer will produce.

/// Represents a single token word; containing type of the token and its lexeme
#[derive(Debug, Clone, PartialEq, Eq)]
pub struct Token<'a> {
    token_kind: TokenKind,
    position_range: Range<SourcePosition>,
    lexeme: &'a str,

This struct has only three fields. It contains the kind of token, a string representing the lexeme, and a position in the source file that makes up this token.

Error Handling

Having a sophisticated error-handling system is crucial when writing a compiler. It should include helpful information for the user and report a useful message.

Let's create a new module called error. This module contains code related to the error-handling system.

Next, create an enum called Error, containing all possible Lexical Error types.

/// An enumeration containing all lexical errors
pub enum Error<'a> {
    InvalidCharacter {
        position: SourcePosition,
        character: char,
        source_refernece: &'a SourceCode,
    },
    UnterminatedMultilineComment {
        multiline_comment_position: SourcePosition,
        source_refernece: &'a SourceCode,
    },
}

Each enum variant stores all necessary information about the error that can use to report the error message.

Next, implement an associated function that prints out the error message based on the stored information.

Full Code Listing of error.rs

impl<'a> Error<'a> {
    /// Prints the error message to the standard output
    pub fn print_error(&self) { ... }
}

The ellipses here signify that the code is lengthy and will not be included here. However, the complete code listing is provided at the beginning as a link to the GitHub repository.

Writing the Lexer

Now, back to writing the lexer. I will write the code of the lexer in the lib.rs file. The Lexer struct consumes the characters and omits one token at a time -- kind of like an iterator.

Psuedo code of using the lexer:

let mut lexer = Lexer::new(&source_code);
let first_token = lexer.lex();
let second_token = lexer.lex();

Let's write a Lexer struct.

/// Represents a lexer in lexical analysis which transforms source code into a
/// token stream
pub struct Lexer<'a> {
    source_code: &'a SourceCode,
    chars: Peekable<CharIndices<'a>>,
    current_position: SourcePosition,
}

It contains a reference to the input source code, an iterator using Peekable<CharIndices>, and a variable for tracking the current lexing position.

Before writing the code for lexing, we need a helper function that eats one character at a time; keeps tracking its current source code position; and returns the character value and the UTF-8 byte position, which is used for slicing the source code for constructing the lexeme.

impl<'a> Lexer<'a> {
    // returns the current character and the its byte index , and moves to the
    // next character
    fn next(&mut self) -> Option<(usize, char)> {
        let next = self.chars.next();

        // increment to the next position
        match next {
            Some(char) => {
                if char.1 == '\n' {
                    self.current_position.line += 1;
                    self.current_position.column = 1;
                } else {
                    self.current_position.column += 1;
                }
            }
            _ => {}
        }
        next
    }
}

Next, we need a hash table that stores every reserved keyword in the language. I'll use the lazy_static crate from crates.io for this case.

lazy_static! {
    static ref KEYWORD: HashMap<&'static str, Keyword> = {
        let mut map = HashMap::new();
        map.insert("return", Keyword::Return);
        map
    };
}

After that, we got everything required for writing the lex function. As discussed earlier, it moves the characters forward and returns the token.

Full Code Listing of lib.rs

impl<'a> Lexer<'a> {
    /// Lexes the current word; returns the corresponding token and moves to
    /// the next token.
    pub fn lex(&mut self) -> Result<Token<'a>, Error<'a>> { ... }
}

The actual implementation is lengthy, but I'll briefly explain how the function is implemented.

First, the function eats the first character by calling the next function and moves to the next character. If it returns None, the lex function exits immediately and returns the EndOfFile token. Otherwise, it proceeds. Next, from the given first character, the function now has a clue of what kind of token will be. The character is then matched against these following if-else branches to determine their token kind.

Python is used here, as the psuedo-code for brevity and conciseness.

If the first character is whitespace, the function eats all the next consecutive whitespaces if there are any. The token is categorized as TokenKind::Space.

if first_char.is_whitespace():   
    while self.peek().is_whitespace():
        self.next()
    return SPACE

If the first character is a forward slash, the function checks whether or not the next character is a forward slash too; if it is the comment is found, the lexer skips all the characters following after until the new line is located, the token is classified as TokenKind::Comment. Otherwise, the lexer typed the single forward slash as TokenKind::Punctuator.

else if first_char == '/':
    if self.peek() == '/':
        while self.peek() != '\n' || self.peek() != END_OF_FILE:
            self.next()
        # eat the '\n'
        self.next()
        return COMMENT
    else
        return PUNCTUATOR

If the first character is an alphabet or underscore, the function eats all the next alphabet or underscore. Next, the function checks whether the given string that has just been eaten is in the keyword hash table. If it is, the token is TokenKind::Keyword; otherwise, it is TokenKind::Identifier.

else if first_char.is_alphabet_or_underscore():
    string = first_char
    while self.peek().is_alphabet_or_underscore():
        string += self.next()

    if string in keyword_table:
        return KEYWORD
    else
        return IDENTIFIER

This one is the simplest; if the first character is punctuation, simply categorize the token as TokenKind::Punctuator.
If the first character is a numeric digit, the function eats all the next numeric digits -- works similarly to the identifier. The token is classified as TokenKind::LiteralConstant.

else if first_char.is_digit():
    string = first_char
    while self.peek().is_digit():
        string += self.next()

    return LITERAL_CONSTANT

Finally, the last else branch returns the Err variant signifying an invalid character is found.

// this character can't be categorized under any token kinds
else {
    return Err(Error::InvalidCharacter {
        position: first_position,
        character: first_char,
        source_refernece: self.source_code,
    });
}

Summary

That's it; we've essentially created a simple lexer in Rust, which can tokenize the source code into a stream of tokens. You can find all the code listings in the GitHub repository.

DEV Community

Crafting a Compiler in Rust: Lexical Analysis

The Pernix Project Crate

The Pernix Lexer Crate

Error Handling

Writing the Lexer

Summary

Top comments (0)

Read next

Como publicar um pacote no npm: um guia passo a passo

Rust on a $2 dev board

Learn TypeScript from A to Z 🚀

A Beginner's Guide to Starting with Laravel Livewire