Let's create a project for the compiler. It'll be named pernix-lang
. Inside the folder pernix-lang
, the Cargo Workspace project locates here. I'll separate each different compiler phase into an individually distinct library crate.
The Pernix Project Crate
Before going into writing the compiler, let's create a crate called pernix_project
.
cargo new --lib compiler/pernix_project
Add the newly created crate path in the Cargo.toml
workspace file.
[workspace]
members = [
"compiler/pernix_project",
]
The workspace should look something like this:
├── Cargo.toml
├── compiler/pernix_project
│ ├── Cargo.toml
│ └── src
│ └── src.rs
This crate contains all the code related to the language's project
such as source files, directories, targets, and libraries.
Let's write code for the pernix_project
crate. I'll create a new module inside the crate named source_code
. This module contains the SourceCode
and SourcePosition
structs; these structs involve source file inputs.
- The
SourceCode
struct manages the content of the source file. It provides utility functions such as inspecting the source file's content by the line number. - The
SourcePosition
consists of two fields: line and column numbers.
Full Code Listing of source_code.rs
/// Represents a source code
pub struct SourceCode {
source_code: String,
source_name: String,
new_line_ranges: Vec<Range<usize>>,
}
/// Represents a particular position in the source code, consisting of column
/// and line
#[derive(Debug, Clone, Copy, Default, PartialEq, Eq)]
pub struct SourcePosition {
pub line: usize,
pub column: usize,
}
Here's the snippet of the code of the SourceCode
struct in action.
#[test]
fn test_line() {
let source_code = SourceCode::new(
"Hello, world!\n".to_string()
+ "My Name Is Pernix\n"
+ "I am a programming language\n"
+ "I am written in Rust\n"
+ "\n",
"test".to_string(),
);
assert_eq!(source_code.line(1).unwrap(), "Hello, world!");
assert_eq!(source_code.line(2).unwrap(), "My Name Is Pernix");
assert_eq!(source_code.line(3).unwrap(), "I am a programming language");
assert_eq!(source_code.line(4).unwrap(), "I am written in Rust");
assert_eq!(source_code.line(5).unwrap(), "");
}
The Pernix Lexer Crate
Before doing the actual code, let's discuss more in-depth Lexical Analysis. Generally speaking, Lexical Analysis is the phase where the source code is fed to a lexer or scanner and converted into a stream of tokens. When working with a lexer, you will likely stumble across these terms: lexeme and token.
- The lexeme is a term used to describe a sequence of characters matched by a well-defined pattern or rule.
- The token is a term used to describe a syntactic category that constitutes the lexeme.
For example, the following code would produce a token stream like this:
int main() {
return 0;
}
TokenKind | Lexeme |
---|---|
IDENTIFIER | int |
IDENTIFIER | main |
PUNCTUATOR | ( |
PUNCTUATOR | ) |
PUNCTUATOR | { |
KEYWORD | return |
LITERAL | 0 |
PUNCTUATOR | ; |
PUNCTUATOR | } |
Now that we've understood the basics of the Lexical Analysis phase, Let's create a library crate that contains the lexer for the compiler, pernix_lexer
.
cargo new --lib compiler/pernix_lexer
[workspace]
members = [
"compiler/pernix_project",
"compiler/pernix_lexer",
]
Let's create a new module in this crate named token
. This module contains the code related to the lexical token.
Now, let's write an enum with all token-kind values.
/// Enumeration containing all patterns of token
#[derive(Debug, Clone, PartialEq, Eq)]
pub enum TokenKind {
Identifier,
Space,
Comment,
Keyword(Keyword),
LiteralConstant,
Punctuator(char),
EndOfFile,
}
There're seven kinds of tokens defined here; Let's discuss each of them one by one.
- An Identifier is a string of characters used to give a name to a particular symbol or declaration.
- A Space represents a string of whitespaces defined in the source file, including whitespaces.
- A Comment represents a text in the source file that doesn't alter the program's behavior but is meant to be an annotation for the programmer.
- A Keyword is a string, similar to an identifier, but the language reserves since it often has a special meaning that commands particular program behavior.
- A Literal Constant is any hard-coded value in the source file, including numbers, boolean-value, and strings.
- A Punctuator is a token with syntactic and semantic meaning to the compiler, but the exact significance depends on the context.
- A EndOfFile signifies the end of the token stream.
Now let's write a Token
struct, the data structure that the lexer will produce.
/// Represents a single token word; containing type of the token and its lexeme
#[derive(Debug, Clone, PartialEq, Eq)]
pub struct Token<'a> {
token_kind: TokenKind,
position_range: Range<SourcePosition>,
lexeme: &'a str,
This struct has only three fields. It contains the kind of token, a string representing the lexeme, and a position in the source file that makes up this token.
Error Handling
Having a sophisticated error-handling system is crucial when writing a compiler. It should include helpful information for the user and report a useful message.
Let's create a new module called error
. This module contains code related to the error-handling system.
Next, create an enum called Error
, containing all possible Lexical Error types.
/// An enumeration containing all lexical errors
pub enum Error<'a> {
InvalidCharacter {
position: SourcePosition,
character: char,
source_refernece: &'a SourceCode,
},
UnterminatedMultilineComment {
multiline_comment_position: SourcePosition,
source_refernece: &'a SourceCode,
},
}
Each enum variant stores all necessary information about the error that can use to report the error message.
Next, implement an associated function that prints out the error message based on the stored information.
Full Code Listing of error.rs
impl<'a> Error<'a> {
/// Prints the error message to the standard output
pub fn print_error(&self) { ... }
}
The ellipses here signify that the code is lengthy and will not be included here. However, the complete code listing is provided at the beginning as a link to the GitHub repository.
Writing the Lexer
Now, back to writing the lexer. I will write the code of the lexer in the lib.rs
file. The Lexer
struct consumes the characters and omits one token at a time -- kind of like an iterator.
Psuedo code of using the lexer:
let mut lexer = Lexer::new(&source_code);
let first_token = lexer.lex();
let second_token = lexer.lex();
Let's write a Lexer
struct.
/// Represents a lexer in lexical analysis which transforms source code into a
/// token stream
pub struct Lexer<'a> {
source_code: &'a SourceCode,
chars: Peekable<CharIndices<'a>>,
current_position: SourcePosition,
}
It contains a reference to the input source code, an iterator using Peekable<CharIndices>
, and a variable for tracking the current lexing position.
Before writing the code for lexing, we need a helper function that eats one character at a time; keeps tracking its current source code position; and returns the character value and the UTF-8 byte position, which is used for slicing the source code for constructing the lexeme.
impl<'a> Lexer<'a> {
// returns the current character and the its byte index , and moves to the
// next character
fn next(&mut self) -> Option<(usize, char)> {
let next = self.chars.next();
// increment to the next position
match next {
Some(char) => {
if char.1 == '\n' {
self.current_position.line += 1;
self.current_position.column = 1;
} else {
self.current_position.column += 1;
}
}
_ => {}
}
next
}
}
Next, we need a hash table that stores every reserved keyword in the language. I'll use the lazy_static
crate from crates.io
for this case.
lazy_static! {
static ref KEYWORD: HashMap<&'static str, Keyword> = {
let mut map = HashMap::new();
map.insert("return", Keyword::Return);
map
};
}
After that, we got everything required for writing the lex
function. As discussed earlier, it moves the characters forward and returns the token.
Full Code Listing of lib.rs
impl<'a> Lexer<'a> {
/// Lexes the current word; returns the corresponding token and moves to
/// the next token.
pub fn lex(&mut self) -> Result<Token<'a>, Error<'a>> { ... }
}
The actual implementation is lengthy, but I'll briefly explain how the function is implemented.
First, the function eats the first character by calling the next
function and moves to the next character. If it returns None,
the lex
function exits immediately and returns the EndOfFile
token. Otherwise, it proceeds. Next, from the given first character, the function now has a clue of what kind of token will be. The character is then matched against these following if-else branches to determine their token kind.
Python is used here, as the psuedo-code for brevity and conciseness.
- If the first character is whitespace, the function eats all the next consecutive whitespaces if there are any. The token is categorized as
TokenKind::Space
.
if first_char.is_whitespace():
while self.peek().is_whitespace():
self.next()
return SPACE
- If the first character is a forward slash, the function checks whether or not the next character is a forward slash too; if it is the comment is found, the lexer skips all the characters following after until the new line is located, the token is classified as
TokenKind::Comment
. Otherwise, the lexer typed the single forward slash asTokenKind::Punctuator
.
else if first_char == '/':
if self.peek() == '/':
while self.peek() != '\n' || self.peek() != END_OF_FILE:
self.next()
# eat the '\n'
self.next()
return COMMENT
else
return PUNCTUATOR
- If the first character is an alphabet or underscore, the function eats all the next alphabet or underscore. Next, the function checks whether the given string that has just been eaten is in the keyword hash table. If it is, the token is
TokenKind::Keyword
; otherwise, it isTokenKind::Identifier
.
else if first_char.is_alphabet_or_underscore():
string = first_char
while self.peek().is_alphabet_or_underscore():
string += self.next()
if string in keyword_table:
return KEYWORD
else
return IDENTIFIER
- This one is the simplest; if the first character is punctuation, simply categorize the token as
TokenKind::Punctuator
. - If the first character is a numeric digit, the function eats all the next numeric digits -- works similarly to the identifier. The token is classified as
TokenKind::LiteralConstant
.
else if first_char.is_digit():
string = first_char
while self.peek().is_digit():
string += self.next()
return LITERAL_CONSTANT
Finally, the last else branch returns the Err
variant signifying an invalid character is found.
// this character can't be categorized under any token kinds
else {
return Err(Error::InvalidCharacter {
position: first_position,
character: first_char,
source_refernece: self.source_code,
});
}
Summary
That's it; we've essentially created a simple lexer in Rust, which can tokenize the source code into a stream of tokens. You can find all the code listings in the GitHub repository.
Top comments (0)