DEV Community

Cover image for Crafting a Compiler in Rust: Lexical Analysis
Simmypeet
Simmypeet

Posted on

Crafting a Compiler in Rust: Lexical Analysis

Let's create a project for the compiler. It'll be named pernix-lang. Inside the folder pernix-lang, the Cargo Workspace project locates here. I'll separate each different compiler phase into an individually distinct library crate.

The Pernix Project Crate

Before going into writing the compiler, let's create a crate called pernix_project.

cargo new --lib compiler/pernix_project
Enter fullscreen mode Exit fullscreen mode

Add the newly created crate path in the Cargo.toml workspace file.

[workspace]
members = [
    "compiler/pernix_project",
]
Enter fullscreen mode Exit fullscreen mode

The workspace should look something like this:

├── Cargo.toml
├── compiler/pernix_project
│   ├── Cargo.toml
│   └── src
│       └── src.rs
Enter fullscreen mode Exit fullscreen mode

This crate contains all the code related to the language's project
such as source files, directories, targets, and libraries.

Let's write code for the pernix_project crate. I'll create a new module inside the crate named source_code. This module contains the SourceCode and SourcePosition structs; these structs involve source file inputs.

  • The SourceCode struct manages the content of the source file. It provides utility functions such as inspecting the source file's content by the line number.
  • The SourcePosition consists of two fields: line and column numbers.

Full Code Listing of source_code.rs

/// Represents a source code
pub struct SourceCode {
    source_code: String,
    source_name: String,
    new_line_ranges: Vec<Range<usize>>,
}

/// Represents a particular position in the source code, consisting of column
/// and line
#[derive(Debug, Clone, Copy, Default, PartialEq, Eq)]
pub struct SourcePosition {
    pub line: usize,
    pub column: usize,
}
Enter fullscreen mode Exit fullscreen mode

Here's the snippet of the code of the SourceCode struct in action.

#[test]
fn test_line() {
    let source_code = SourceCode::new(
        "Hello, world!\n".to_string()
            + "My Name Is Pernix\n"
            + "I am a programming language\n"
            + "I am written in Rust\n"
            + "\n",
        "test".to_string(),
    );

    assert_eq!(source_code.line(1).unwrap(), "Hello, world!");
    assert_eq!(source_code.line(2).unwrap(), "My Name Is Pernix");
    assert_eq!(source_code.line(3).unwrap(), "I am a programming language");
    assert_eq!(source_code.line(4).unwrap(), "I am written in Rust");
    assert_eq!(source_code.line(5).unwrap(), "");
}
Enter fullscreen mode Exit fullscreen mode

The Pernix Lexer Crate

Before doing the actual code, let's discuss more in-depth Lexical Analysis. Generally speaking, Lexical Analysis is the phase where the source code is fed to a lexer or scanner and converted into a stream of tokens. When working with a lexer, you will likely stumble across these terms: lexeme and token.

  • The lexeme is a term used to describe a sequence of characters matched by a well-defined pattern or rule.
  • The token is a term used to describe a syntactic category that constitutes the lexeme.

For example, the following code would produce a token stream like this:

int main() {
    return 0;
}
Enter fullscreen mode Exit fullscreen mode
TokenKind Lexeme
IDENTIFIER int
IDENTIFIER main
PUNCTUATOR (
PUNCTUATOR )
PUNCTUATOR {
KEYWORD return
LITERAL 0
PUNCTUATOR ;
PUNCTUATOR }

Now that we've understood the basics of the Lexical Analysis phase, Let's create a library crate that contains the lexer for the compiler, pernix_lexer.

cargo new --lib compiler/pernix_lexer
Enter fullscreen mode Exit fullscreen mode
[workspace]
members = [
    "compiler/pernix_project",
    "compiler/pernix_lexer",
]
Enter fullscreen mode Exit fullscreen mode

Let's create a new module in this crate named token. This module contains the code related to the lexical token.

Now, let's write an enum with all token-kind values.

/// Enumeration containing all patterns of token
#[derive(Debug, Clone, PartialEq, Eq)]
pub enum TokenKind {
    Identifier,
    Space,
    Comment,
    Keyword(Keyword),
    LiteralConstant,
    Punctuator(char),
    EndOfFile,
}
Enter fullscreen mode Exit fullscreen mode

There're seven kinds of tokens defined here; Let's discuss each of them one by one.

  • An Identifier is a string of characters used to give a name to a particular symbol or declaration.
  • A Space represents a string of whitespaces defined in the source file, including whitespaces.
  • A Comment represents a text in the source file that doesn't alter the program's behavior but is meant to be an annotation for the programmer.
  • A Keyword is a string, similar to an identifier, but the language reserves since it often has a special meaning that commands particular program behavior.
  • A Literal Constant is any hard-coded value in the source file, including numbers, boolean-value, and strings.
  • A Punctuator is a token with syntactic and semantic meaning to the compiler, but the exact significance depends on the context.
  • A EndOfFile signifies the end of the token stream.

Now let's write a Token struct, the data structure that the lexer will produce.

/// Represents a single token word; containing type of the token and its lexeme
#[derive(Debug, Clone, PartialEq, Eq)]
pub struct Token<'a> {
    token_kind: TokenKind,
    position_range: Range<SourcePosition>,
    lexeme: &'a str,
Enter fullscreen mode Exit fullscreen mode

This struct has only three fields. It contains the kind of token, a string representing the lexeme, and a position in the source file that makes up this token.

Error Handling

Having a sophisticated error-handling system is crucial when writing a compiler. It should include helpful information for the user and report a useful message.

Let's create a new module called error. This module contains code related to the error-handling system.

Next, create an enum called Error, containing all possible Lexical Error types.

/// An enumeration containing all lexical errors
pub enum Error<'a> {
    InvalidCharacter {
        position: SourcePosition,
        character: char,
        source_refernece: &'a SourceCode,
    },
    UnterminatedMultilineComment {
        multiline_comment_position: SourcePosition,
        source_refernece: &'a SourceCode,
    },
}
Enter fullscreen mode Exit fullscreen mode

Each enum variant stores all necessary information about the error that can use to report the error message.

Next, implement an associated function that prints out the error message based on the stored information.

Full Code Listing of error.rs

impl<'a> Error<'a> {
    /// Prints the error message to the standard output
    pub fn print_error(&self) { ... }
}
Enter fullscreen mode Exit fullscreen mode

The ellipses here signify that the code is lengthy and will not be included here. However, the complete code listing is provided at the beginning as a link to the GitHub repository.

Writing the Lexer

Now, back to writing the lexer. I will write the code of the lexer in the lib.rs file. The Lexer struct consumes the characters and omits one token at a time -- kind of like an iterator.

Psuedo code of using the lexer:

let mut lexer = Lexer::new(&source_code);
let first_token = lexer.lex();
let second_token = lexer.lex();
Enter fullscreen mode Exit fullscreen mode

Let's write a Lexer struct.

/// Represents a lexer in lexical analysis which transforms source code into a
/// token stream
pub struct Lexer<'a> {
    source_code: &'a SourceCode,
    chars: Peekable<CharIndices<'a>>,
    current_position: SourcePosition,
}
Enter fullscreen mode Exit fullscreen mode

It contains a reference to the input source code, an iterator using Peekable<CharIndices>, and a variable for tracking the current lexing position.

Before writing the code for lexing, we need a helper function that eats one character at a time; keeps tracking its current source code position; and returns the character value and the UTF-8 byte position, which is used for slicing the source code for constructing the lexeme.

impl<'a> Lexer<'a> {
    // returns the current character and the its byte index , and moves to the
    // next character
    fn next(&mut self) -> Option<(usize, char)> {
        let next = self.chars.next();

        // increment to the next position
        match next {
            Some(char) => {
                if char.1 == '\n' {
                    self.current_position.line += 1;
                    self.current_position.column = 1;
                } else {
                    self.current_position.column += 1;
                }
            }
            _ => {}
        }
        next
    }
}
Enter fullscreen mode Exit fullscreen mode

Next, we need a hash table that stores every reserved keyword in the language. I'll use the lazy_static crate from crates.io for this case.

lazy_static! {
    static ref KEYWORD: HashMap<&'static str, Keyword> = {
        let mut map = HashMap::new();
        map.insert("return", Keyword::Return);
        map
    };
}
Enter fullscreen mode Exit fullscreen mode

After that, we got everything required for writing the lex function. As discussed earlier, it moves the characters forward and returns the token.

Full Code Listing of lib.rs

impl<'a> Lexer<'a> {
    /// Lexes the current word; returns the corresponding token and moves to
    /// the next token.
    pub fn lex(&mut self) -> Result<Token<'a>, Error<'a>> { ... }
}
Enter fullscreen mode Exit fullscreen mode

The actual implementation is lengthy, but I'll briefly explain how the function is implemented.

First, the function eats the first character by calling the next function and moves to the next character. If it returns None, the lex function exits immediately and returns the EndOfFile token. Otherwise, it proceeds. Next, from the given first character, the function now has a clue of what kind of token will be. The character is then matched against these following if-else branches to determine their token kind.

Python is used here, as the psuedo-code for brevity and conciseness.

  • If the first character is whitespace, the function eats all the next consecutive whitespaces if there are any. The token is categorized as TokenKind::Space.
if first_char.is_whitespace():   
    while self.peek().is_whitespace():
        self.next()
    return SPACE    
Enter fullscreen mode Exit fullscreen mode
  • If the first character is a forward slash, the function checks whether or not the next character is a forward slash too; if it is the comment is found, the lexer skips all the characters following after until the new line is located, the token is classified as TokenKind::Comment. Otherwise, the lexer typed the single forward slash as TokenKind::Punctuator.
else if first_char == '/':
    if self.peek() == '/':
        while self.peek() != '\n' || self.peek() != END_OF_FILE:
            self.next()
        # eat the '\n'
        self.next()
        return COMMENT
    else
        return PUNCTUATOR
Enter fullscreen mode Exit fullscreen mode
  • If the first character is an alphabet or underscore, the function eats all the next alphabet or underscore. Next, the function checks whether the given string that has just been eaten is in the keyword hash table. If it is, the token is TokenKind::Keyword; otherwise, it is TokenKind::Identifier.
else if first_char.is_alphabet_or_underscore():
    string = first_char
    while self.peek().is_alphabet_or_underscore():
        string += self.next()

    if string in keyword_table:
        return KEYWORD
    else
        return IDENTIFIER
Enter fullscreen mode Exit fullscreen mode
  • This one is the simplest; if the first character is punctuation, simply categorize the token as TokenKind::Punctuator.
  • If the first character is a numeric digit, the function eats all the next numeric digits -- works similarly to the identifier. The token is classified as TokenKind::LiteralConstant.
else if first_char.is_digit():
    string = first_char
    while self.peek().is_digit():
        string += self.next()

    return LITERAL_CONSTANT
Enter fullscreen mode Exit fullscreen mode

Finally, the last else branch returns the Err variant signifying an invalid character is found.

// this character can't be categorized under any token kinds
else {
    return Err(Error::InvalidCharacter {
        position: first_position,
        character: first_char,
        source_refernece: self.source_code,
    });
}
Enter fullscreen mode Exit fullscreen mode

Summary

That's it; we've essentially created a simple lexer in Rust, which can tokenize the source code into a stream of tokens. You can find all the code listings in the GitHub repository.

Top comments (0)