DEV Community

Cover image for Crafting a Compiler in Rust: Syntactic Analysis
Simmypeet
Simmypeet

Posted on

Crafting a Compiler in Rust: Syntactic Analysis

In this post, I will introduce you to the basics of Syntactic Analysis. Moreover, we'll write a basic parser that takes the token stream and converts it into Abstract Syntax Trees.


Update on the Lexer

Before going into writing the parser, let me inform you about changes I've made to the Lexer🌟.

  • Now, the Lexer supports two more literal constant categories: Floating and Boolean.

  • The literal suffix is now available. For example, in the code let number = 0i64, the 0i64 has a literal suffix being i64 and a literal value of 0.

#[derive(Debug, Clone, PartialEq, Eq)]
pub enum LiteralConstantType<'a> {
    Integer {
        value: &'a str,
        literal_suffix: Option<&'a str>,
    },
    Float {
        value: &'a str,
        literal_suffix: Option<&'a str>,
    },
    Boolean(bool),
}

/// Enumeration containing all patterns of token
#[derive(Debug, Clone, PartialEq, Eq)]
pub enum TokenKind<'a> {
    Identifier,
    Space,
    Comment,
    Keyword(Keyword),
    LiteralConstant(LiteralConstantType<'a>),
    Punctuator(char),
    EndOfFile,
}
Enter fullscreen mode Exit fullscreen mode

Full code changes to the pernix_lexer module is available.


Syntactic Analysis Basics

As mentioned in the introduction, The Syntactic Analysis continues from the Lexical Analysis phase. But what does it do?

It takes a token stream produced in the earlier phase into a meaningful data structure called Abstract Syntax Tree (AST). Additionally, this is the phase where all the syntactic errors are caught.

The Abstract Syntax Tree

The Abstract Syntax Tree is a tree representation of the code fed to the compiler. It contains only the crucial information of the code and extracts away the syntactic detail of the language.

Take the following code snippets as an example:

function add(a, b) {
    return a + b;
}
Enter fullscreen mode Exit fullscreen mode
def add(a, b):
    return a + b
Enter fullscreen mode Exit fullscreen mode

There's JavaScript and Python code above. Both functions from both languages behave precisely the same. If both code snippets are parsed, it will produce an AST similar to this:

The Abstract Syntax Tree Representation

In JSON representation

{
  "function": {
    "parameters": [
      "a",
      "b"
    ],
    "statements": [
      {
        "type": "ReturnStatement",
        "expression": {
          "type": "BinaryOperatorExpression",
          "left": {
            "type": "VariableExpression",
            "variable": "a"
          },
          "operator": "+",
          "right": {
            "type": "VariableExpression",
            "variable": "b"
          }
        }
      }
    ]
  }
}
Enter fullscreen mode Exit fullscreen mode

The AST produced contains all the crucial information about the function.

  • Its parameter names and count.
  • The statement inside the function body; contains only one return statement, which returns an expression of a + b.

Even though Javascript and Python have different syntaxes, the AST produced by both languages here would be similar or identical since AST doesn't consider the language's syntax detail. That's why it's called Abstract Syntax Tree.


The Goal

Before diving into writing the parser, it's necessary to scope the feature of the parser. What features will be included? What does the code that will be parsed look like?

Therefore, I made a reference snippet of code to guide us when writing the parser:

using Simmypeet.Math;

namespace Simmypeet {
    namespace Math {
        int32 Add(int32 a, int32 b) {
            return a + b;
        }

        int32 Subtract(int32 a, int32 b) {
            return a + b;
        }
    }

    int32 Fibonacci(int32 n) {
        if (n == 0) {
            return 0;
        }
        else if (n == 1) {
            return 1;
        } 
        else {
            return Add(Fibonacci(n - 1) + Fibonacci(n - 2));
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

The code above is similar to C#, which is😁. It also tells us a lot about what features we have to implement.

  • If-else statement
  • Namespace
  • Using declaration (bringing the namespace scope)
  • Function call
  • Arithmetic operation
  • Comparison
  • Function declaration

The ultimate goal is to implement a parser that correctly parses the above code. Now, it's time to write some code and get our hands dirty!


The pernix_parser Crate

Let's start by creating a new crate named pernix_parser.

cargo new --lib compiler/pernix_parser
Enter fullscreen mode Exit fullscreen mode
[workspace]
members = [
    "compiler/pernix_project",
    "compiler/pernix_lexer",
    "compiler/pernix_parser",
]
Enter fullscreen mode Exit fullscreen mode

Before going into implementing the actual parser, it's necessary to write all sorts of Abstract Syntax Tree data structures first.

The abstract_syntax_tree Module

We'll create a new module called abstract_syntax_tree in the pernix_parser crate. This module holds all data structures relating to the Abstract Syntax Tree.

When producing an AST, it would be beneficial to include the source code position the parser parsed to make an AST. So, in the later phase, when errors occur, the error message can consist of the position in the source code that produces the errors.

Therefore, in pernix_parser/abstract_syntax_tree.rs, we'll write a wrapper struct consisting of two fields: the abstract syntax tree value and its position range.

/// A wrapper around a value that also contains the position of the value in the
/// source code
#[derive(Debug, Clone, PartialEq, Eq)]
pub struct PositiionWrapper<T> {
    pub position: Range<SourcePosition>,
    pub value: T,
}
Enter fullscreen mode Exit fullscreen mode

Next, let's write an enum that holds all the values of binary operators.

/// List of all available binary operators
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum BinaryOperator {
    Add,
    Subtract,
    Asterisk,
    Slash,
    Percent,
    LessThan,
    GreaterThan,
    LessThanEqual,
    GreaterThanEqual,
    Equal,
    NotEqual,
    And,
    Or,
}
Enter fullscreen mode Exit fullscreen mode

Next, write an enum that contains all the values of unary operators.

/// List of all available unary operators
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum UnaryOperator {
    Plus,
    Minus,
    LogicalNot,
}
Enter fullscreen mode Exit fullscreen mode

Next, include another enum with all the ways to refer to a type.

/// An enumeration containing all kinds of types referenced in the program
#[derive(Debug, Clone, PartialEq, Eq)]
pub enum Type {
    Identifier(String),
    Array {
        element_type: Box<Type>,
        size: usize,
    },
}
Enter fullscreen mode Exit fullscreen mode

For example, a code with int32[5], an array of five int32s, would produce an Type enum in the code similar to this:

Type::Array {
    element_type: Box::new(Type::Identifier("int32".to_string)),
    size: 5
}
Enter fullscreen mode Exit fullscreen mode

Next, let's create a submodule inside the abstract_syntax_tree module called expression. This submodule contains all the code related to Expresison Abstract Syntax Tree.

Expression refers to a code in the program that can be evaluated and yields a value. For example, in the x = 4 + 3 statement, the 4 + 3 is an expression.

Now, let's write an enum containing all the forms of an expression.

/// Enumeration containing all possible expressions
#[derive(Debug, Clone, PartialEq, Eq)]
pub enum Expression<'a> {
    /// Represents an expression of the form `left operator right`
    BinaryExpression {
        left: Box<PositiionWrapper<Expression<'a>>>,
        operator: PositiionWrapper<BinaryOperator>,
        right: Box<PositiionWrapper<Expression<'a>>>,
    },

    /// Represents an expression of the form `operator operand`
    UnaryExpression {
        operator: PositiionWrapper<UnaryOperator>,
        operand: Box<PositiionWrapper<Expression<'a>>>,
    },

    /// Represents an expression of the form `literal`
    LiteralExpression(LiteralConstantType<'a>),

    /// Represents an expression of the form `identifier`
    IdentifierExpression {
        identifier: PositiionWrapper<String>,
    },

    /// Represents an expression of the form `function_name(arguments)`
    FunctionCallExpression {
        function_name: PositiionWrapper<String>,
        arguments: Vec<PositiionWrapper<Expression<'a>>>,
    },
}
Enter fullscreen mode Exit fullscreen mode
  • The BinaryExpression variant consists of two operand expressions separated by one binary operator, such as 4 + 5.
  • The UnaryExpression variant consists of an operand expression and a unary operator, for example, !true.
  • The LiteralExpression is straightforward; the expression yielded from the hardcoded value.
  • The IdentifierExpression refers to a reference to a variable.
  • The FunctionCallExpression refers to a call to a specific function by name and by providing arguments.

Next, let's write all the variants of the statement.

/// Enumeration containing all kinds of statements
#[derive(Debug, Clone, PartialEq, Eq)]
pub enum Statement<'a> {
    /// Represents a statement of the form `return expression;`
    ReturnStatement {
        expression: PositiionWrapper<Expression<'a>>,
    },

    /// Represents a statement of the form `expression;`
    ExpressionStatement {
        expression: PositiionWrapper<Expression<'a>>,
    },

    /// Represents a statement of the form `let identifier = expression;`
    VariableDeclarationStatement {
        identifier: PositiionWrapper<String>,
        expression: PositiionWrapper<Expression<'a>>,
    },

    /// Represents an if statement of the form `if (condition) then_statement
    /// else else_statement`
    IfStatement {
        condition: PositiionWrapper<Expression<'a>>,
        then_statement: Vec<PositiionWrapper<Statement<'a>>>,
        else_statement: Option<Vec<PositiionWrapper<Statement<'a>>>>,
    },
}
Enter fullscreen mode Exit fullscreen mode

Statement is a fragment of code that carries out a specific action in the program.

  • The ReturnStatement variant represents a statement that returns a value from a function.
  • The ExpressionStatement variant represents a statement that evaluates an expression.
  • The VariableDeclarationStatement variant represents a statement that declares a new variable.
  • The IfStatement variant represents a statement that executes a block of code if a condition is true. It can also execute a different block of code if the condition is false.

Next, let's create a new submodule inside the abstract_syntax_tree module called declaration. This module holds all the AST code related to all the declarations that you would define in a program, such as namespace declarations, functions, and classes.

/// A declaration is a statement that declares a new name in the current scope.
#[derive(Debug, Clone, PartialEq, Eq)]
pub enum Declaration<'a> {
    /// Represents a namespace declaration of the form `namespace namespace_name { declarations* }`
    NamespaceDeclaration {
        namespace_name: PositiionWrapper<String>,
        declarations: Vec<PositiionWrapper<Declaration<'a>>>,
    },

    /// Represents a function declaration of the form `type function_name(parameters) { statements* }`
    FunctionDeclaration {
        function_name: PositiionWrapper<String>,
        parameters: Vec<PositiionWrapper<String>>,
        return_type: PositiionWrapper<Type>,
        body: Vec<PositiionWrapper<Statement<'a>>>,
    },

    /// Represents a namespace using statement of the form `using namespace_name;`
    UsingStatement{
        namespace_name: PositiionWrapper<String>,
    },
}
Enter fullscreen mode Exit fullscreen mode
  • The NamespaceDeclaration variant represents a declaration that declares a new namespace and contains other declarations inside it.
  • The FunctionDeclaration variant represents a declaration that declares a new function and contains a list of statements inside it.
  • The UsingDeclaration variant represents a declaration that imports a namespace into the current scope.

The error Module

Now, it's time to implement an error module. This module will contain all the error types that we will use in the parser.

/// Enumeration containing all possible contexts that the error can occur in.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum Context {
    Program,
    Statement,
    Expression,
    Namespace,
}

/// Enumeration containing all possible errors that can occur during parsing.
#[derive(Debug, Clone)]
pub enum Error<'a> {
    LexicalError(pernix_lexer::error::Error<'a>),
    KeywordExpected {
        expected_keyword: Keyword,
        found_token: Token<'a>,
        source_reference: &'a SourceCode,
    },
    IdentifierExpected {
        found_token: Token<'a>,
        source_reference: &'a SourceCode,
    },
    PunctuatorExpected {
        expected_punctuator: char,
        found_token: Token<'a>,
        source_reference: &'a SourceCode,
    },
    UnexpectedToken {
        context: Context,
        found_token: Token<'a>,
        source_reference: &'a SourceCode,
    },
}
Enter fullscreen mode Exit fullscreen mode
  • The Context enumeration contains all the possible contexts in which the error can occur. For example, if the error occurs in the Program context, it means that the error occurred in the top level of the program.

  • The Error enumeration contains all the possible errors that can occur during parsing. The LexicalError variant is used to wrap the error that occurred during the lexical analysis. The other variants are used to represent the errors that occurred during the parsing.


Writing The Parser

The parser is a state machine that behaves similarly to the lexer. It turns the token stream into a meaningful AST data structure. The parser that we will write is a combination of Recursive Descent Parsing and Operator-Precedence Parsing.

/// Represents a struct that parases the given token stream into abstract syntax
/// trees
pub struct Parser<'a> {
    // The lexer that is used to generate the token stream
    lexer: Lexer<'a>,
    // the accumulated tokens that have been tokenized by the lexer so far
    accumulated_tokens: Vec<Token<'a>>,
    // the accumulated errors that have been found during parsing so far
    accumulated_errors: Vec<Error<'a>>,
    // The current position in the source code
    current_position: usize,
    // Flag that indicates whether the parser should produce errors into the
    // list or not
    produce_errors: bool,
}
Enter fullscreen mode Exit fullscreen mode

In the following sections, I'll explain the fields of the Parser struct.

  • The lexer field is used to generate the token stream. It's a struct that we've already implemented in the previous post.
  • The accumulated_tokens field is used to store the tokens that have been tokenized by the lexer so far. The parser will use this field to get the token already tokenized by the lexer.
  • The accumulated_errors field stores the errors found during parsing.
  • The current_position field stores the current token stream index.
  • The produce_errors field indicates whether the parser should produce errors into the list or not. This field prevents the parser from producing errors when recovering from an error.

Let's create a new constructor for the Parser struct.

impl<'a> Parser<'a> {
    /// Creates a new parser instance from the given source code
    pub fn new(source_code: &'a SourceCode) -> Self {
        let mut lexer = Lexer::new(source_code);
        let mut accumulated_errors = Vec::new();

        // get the first token
        let accumulated_tokens = vec![{
            let token_return;
            loop {
                match lexer.lex() {
                    Ok(token) => {
                        token_return = token;
                        break;
                    }
                    Err(err) => {
                        accumulated_errors.push(Error::LexicalError(err));
                    }
                }
            }
            token_return
        }];

        Self {
            lexer,
            accumulated_tokens,
            accumulated_errors,
            current_position: 0,
            produce_errors: true,
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

The function creates a new Lexer instance from the source code. First, it gets the first token from the lexer and stores it in the accumulated_tokens field. Then, it creates a new Parser instance and returns it.

Now, let's implement the next function. This function will return the current token and move the current_position field to the next token. If the current_position field is equal to the length of the accumulated_tokens field, it means that the parser has reached the end of the token stream. In this case, the function will get the next token from the lexer and store it in the accumulated_tokens field. Moreover, if the lexer encounters an error, the function will store the error in the accumulated_errors field.

impl<'a> Parser<'a> {
    // Gets the current token stream and moves the current position forward
    pub(crate) fn next(&mut self) -> &Token<'a> {
        // need to generate more tokens
        if self.current_position == self.accumulated_tokens.len() - 1 {
            let new_token;
            loop {
                match self.lexer.lex() {
                    Ok(token) => {
                        new_token = token;
                        break;
                    }
                    Err(err) => {
                        self.accumulated_errors.push(Error::LexicalError(err));
                    }
                }
            }

            // append the new token to the accumulated tokens
            self.accumulated_tokens.push(new_token);
        }
        // panic if the current position is greater than or equal the length
        else if self.current_position >= self.accumulated_tokens.len() {
            panic!("current position is greater than or equal the length of the accumulated tokens");
        }

        // increment the current position
        self.current_position += 1;

        &self.accumulated_tokens[self.current_position - 1]
    }
}
Enter fullscreen mode Exit fullscreen mode

Now, let's implement the peek function. This function will simply index the
accumulated_tokens field with the current_position field. It will not
move the position forward.

impl<'a> Parser<'a> {
    // Gets the current token without moving the current position forward
    fn peek(&self) -> &Token<'a> {
        &self.accumulated_tokens[self.current_position]
    }

    // Gets the token back by one position without moving the current position
    fn peek_back(&self) -> &Token<'a> {
        &self.accumulated_tokens[self.current_position - 1]
    }
}
Enter fullscreen mode Exit fullscreen mode

Before going into writing the parsing function, let's discuss insignificant tokens. In the previous post, we implemented the lexer that can tokenize the comments and whitespaces into tokens. However, in most cases, the syntax doesn't care about these whitespaces and comments. For example, the following code is valid in the language.

let x =   5;          // extra spaces? it's okay!
let x=5;              // no spaces? it's okay too!
let x = /*hello?*/ 5; // comment in the middle? that's fine!
Enter fullscreen mode Exit fullscreen mode

However, in some cases, it can be crucial. For example, the only difference between the comparison operators == and two assignment operators = is the space between them. In the following code, the first line will assign the value of 5 to the variable x; the second line will compare the value of x with 5; the rest is invalid.

x = 5;            // yes
x == 5;           // yeah
x = = 5;          // ???
x =/*hello?*/= 5; // what?
Enter fullscreen mode Exit fullscreen mode

Therefore, we need to implement a function that can skip insignificant tokens. The function will be called move_to_significant and implemented as follows.

impl<'a> Parser<'a> {
    /// Moves the current position to the next significant token
    pub(crate) fn move_to_significant(&mut self) {
        while !self.peek().is_significant_token() {
            self.next();
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

And on top of that, implement the function that moves to the next significant token, returns it, and moves the position forward.

impl<'a> Parser<'a> {
    // Moves the current position to the next significant token and emits it
    pub(crate) fn next_significant(&mut self) -> &Token<'a> {
        self.move_to_significant();
        self.next()
    }
}
Enter fullscreen mode Exit fullscreen mode

And if you are wondering how the is_significant_token is implemented, it is as follows.

impl<'a> Token<'a> {
    /// Returns a boolean indicating whether this [`Token`] is a significant.
    /// Insignificant tokens are spaces and comments.
    pub fn is_significant_token(&self) -> bool {
        match self.token_kind() {
            TokenKind::Space | TokenKind::Comment => false,
            _ => true,
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

Next, let's implement the function that appends an error to the accumulated_errors field if the produce_errors field is true.

impl<'a> Token<'a> {
    // Appends the given error to the list of accumulated errors if the
    // `produce_errors` flag is set to true. Moreover, it also rolls the parser
    // position back to the `rollback_position`.
    #[must_use]
    pub fn append_error<T>(&mut self, error: Error<'a>) -> Option<T> {
        if self.produce_errors {
            self.accumulated_errors.push(error);
        }

        None
    }
}
Enter fullscreen mode Exit fullscreen mode

Now that we've got the basic functions, let's implement the parsing function.

Parsing The Qualified Name

Let's start with something simple. Let's implement the function that parses the qualified name. The qualified name is the scope's name, such as Simmypeet.Compiler.Parser. It follows the pattern of an identifier followed by a dot, followed by another identifier, and so on.

pub fn parse_qualified_name(&mut self) -> Option<PositiionWrapper<String>> {
    self.move_to_significant();
    let starting_position = self.peek().position_range().start;

    // the string to return
    let mut string = String::new();

    // expect the first identifier
    if let TokenKind::Identifier = self.next().token_kind() {
        string.push_str(self.peek_back().lexeme());
    } else {
        return self.append_error(Error::IdentifierExpected {
            found_token: self.peek_back().clone(),
            source_reference: self.source_code(),
        });
    }

    // found additional scopes
    while matches!(self.peek().token_kind(), TokenKind::Punctuator('.')) {
        // eat the dot
        self.next();

        string.push('.');

        // expect the next identifier
        if let TokenKind::Identifier = self.next().token_kind() {
            string.push_str(self.peek_back().lexeme());
        } else {
            return self.append_error(Error::IdentifierExpected {
                found_token: self.peek_back().clone(),
                source_reference: self.source_code(),
            });
        }
    }

    Some(PositiionWrapper {
        position: starting_position..self.peek_back().position_range().end,
        value: string,
    })
}
Enter fullscreen mode Exit fullscreen mode

The function first moves the position to the next significant token, as sometimes the parser's position is not at the significant token. Then, it records the starting position of the qualified name. This pattern will be used in the rest of the parsing functions.

It then expects the first identifier and stores it in the string variable. If the first token is not an identifier, it records an error and returns None.

Finally, it checks if a dot follows the identifier. If it is, it eats the dot, eats more identifiers, and pushes them to the string variable. It then checks again if the next token is a dot. If it is, it repeats the process. If it is not, it returns the string variable.

Parsing The Using Statement

Next, let's implement the function that parses the using statement. The statement uses the qualified name we've implemented in the previous section. The parser will expect the using keyword, the qualified name, and a semicolon.

/// Parses the current token stream as a using_statement
///
/// Using_Statement:
///   `using` Qualified_Name `;`
pub fn parse_using_statement(
    &mut self,
) -> Option<PositiionWrapper<Declaration<'a>>> {
    // move to the first significant token
    self.move_to_significant();
    let starting_position = self.peek().position_range().start;

    // expect the `using` keyword
    if !matches!(
        self.next().token_kind(),
        TokenKind::Keyword(Keyword::Using)
    ) {
        return self.append_error(Error::KeywordExpected {
            expected_keyword: Keyword::Using,
            found_token: self.peek_back().clone(),
            source_reference: self.source_code(),
        });
    }

    let qualified_name = self.parse_qualified_name()?;

    // expect the semicolon
    if !matches!(
        self.next_significant().token_kind(),
        TokenKind::Punctuator(';')
    ) {
        return self.append_error(Error::PunctuatorExpected {
            expected_punctuator: ';',
            found_token: self.peek_back().clone(),
            source_reference: self.source_code(),
        });
    }

    Some(PositiionWrapper {
        position: starting_position..self.peek_back().position_range().end,
        value: Declaration::UsingStatement {
            namespace_name: qualified_name,
        },
    })
}
Enter fullscreen mode Exit fullscreen mode

Parsing The Namespace Declaration

This one is a bit more complicated. Before we start, let's look at our error recovery strategy. In a naive approach, we can stop parsing if the parser encounters an error. However, this is not a good idea, as it discards the rest of the possible valid tokens. Instead, we can use the error recovery strategy to continue parsing the rest of the tokens after the error.

We'll use the Panic Mode Error Recovery strategy in our cases. In this strategy, after the parser encounters an error, it will look for the next valid token and continue working from there, discarding the invalid tokens. This strategy is the most used error recovery strategy, as it is simple to implement and effective.

Let's put us in the perspective of the parser. Consider the following code:

mod simmypeet {
    use something;
    use something_else;

    1234 420 // Oops! it should not be here

    mod name {}
}
Enter fullscreen mode Exit fullscreen mode

As you can see, there're number literals in the middle of the module declaration. When the parser encounters the number literals, it will record an error saying something like "unexpected token, number literal is not expected here". Then, it will look for the next valid token, which is the mod keyword. Our parser will not simply continue parsing from the 1234 token then to encounter another 420 token and record another error. Instead, it will skip the 1234 and 420 tokens, and continue parsing from the next mod name declaration.

Now that we've got the error recovery strategy, let's implement the namespace declaration parsing function. The parser will expect the namespace keyword, followed by the qualified identifier, followed by the opening curly brace.

/// Parses the current token stream as a namespace declaration
///
/// Namespace_Declaration:
///    `namespace` Qualified_Name `{` Declaration* `}`
pub fn parse_namespace_declaration(
    &mut self,
) -> Option<PositiionWrapper<Declaration<'a>>> {
    // move to the first significant token
    self.move_to_significant();
    let starting_position = self.peek().position_range().start;

    // expect the `namespace` keyword
    if !matches!(
        self.next().token_kind(),
        TokenKind::Keyword(Keyword::Namespace)
    ) {
        return self.append_error(Error::KeywordExpected {
            expected_keyword: Keyword::Namespace,
            found_token: self.peek_back().clone(),
            source_reference: self.source_code(),
        });
    }

    let namespace_name = self.parse_qualified_name()?;

    // expect the opening curly bracket
    if !matches!(
        self.next_significant().token_kind(),
        TokenKind::Punctuator('{')
    ) {
        return self.append_error(Error::PunctuatorExpected {
            expected_punctuator: '{',
            found_token: self.peek_back().clone(),
            source_reference: self.source_code(),
        });
    }

    /* More to come ... */
}
Enter fullscreen mode Exit fullscreen mode

Next, we'll instantiate a new Vec to store the declarations inside the namespace. Then, we'll start a loop that will parse the declarations inside the namespace. The loop is terminated when the parser encounters the closing curly brace or reaches the end of the file.

/// Parses the current token stream as a namespace declaration
///
/// Namespace_Declaration:
///    `namespace` Qualified_Name `{` Declaration* `}`
pub fn parse_namespace_declaration(
    &mut self,
) -> Option<PositiionWrapper<Declaration<'a>>> {
    /* From the previous code ... */

    // loop through all the declarations
    let mut declarations = Vec::new();

    // move to the next significant token
    self.move_to_significant();

    // loop through all the declarations until closing curly bracket or
    // EOF is found
    while !matches!(
        self.peek().token_kind(),
        TokenKind::Punctuator('}') | TokenKind::EndOfFile
    ) {
        let declaration = match self.peek().token_kind() {
            TokenKind::Keyword(Keyword::Namespace) => {
                self.parse_namespace_declaration()
            }
            TokenKind::Keyword(Keyword::Using) => {
                self.parse_using_statement()
            }
            _ => self.append_error(Error::UnexpectedToken {
                context: Context::Namespace,
                found_token: self.peek().clone(),
                source_reference: self.source_code(),
            }),
        };

        if let Some(declaration) = declaration {
            declarations.push(declaration);
        } else {
            // skip to the next available declaration all delimeter
            self.skip_to(|token| {
                matches!(
                    token.token_kind(),
                    TokenKind::Keyword(Keyword::Namespace)
                        | TokenKind::Keyword(Keyword::Using)
                        | TokenKind::Punctuator('}')
                )
            });
        }

        // move to the next significant token
        self.move_to_significant();
    }

    /* More to come ... */
}
Enter fullscreen mode Exit fullscreen mode

You might have noticed that we're using the skip_to method to skip to the subsequent available declaration or the closing curly brace. This method is a helper method that will bypass the tokens until the predicate returns true. In our case, we're skipping the tokens until we encounter the namespace, using, or } tokens.

Finally, we'll expect the closing curly brace and return the namespace declaration.

/// Parses the current token stream as a namespace declaration
///
/// Namespace_Declaration:
///    `namespace` Qualified_Name `{` Declaration* `}`
pub fn parse_namespace_declaration(
    &mut self,
) -> Option<PositiionWrapper<Declaration<'a>>> {
    /* From the previous code ... */

    // expect the closing curly bracket
    if !matches!(
        self.next_significant().token_kind(),
        TokenKind::Punctuator('}')
    ) {
        return self.append_error(Error::PunctuatorExpected {
            expected_punctuator: '}',
            found_token: self.peek_back().clone(),
            source_reference: self.source_code(),
        });
    }

    Some(PositiionWrapper {
        position: starting_position..self.peek_back().position_range().end,
        value: Declaration::NamespaceDeclaration {
            namespace_name,
            declarations,
        },
    })
}
Enter fullscreen mode Exit fullscreen mode

Parsing The Program

Finally, we'll write the parse_program method that will parse the entire program. Luckily, the program is just a list of declarations, so we'll just loop through all the declarations and parse them similarly to how we parsed the namespace declaration.

/// Parses all token stream as a program
///
/// Program:
///  Declaration*
pub fn parse_program(&mut self) -> Option<Program<'a>> {
    let mut program = Program {
        declarations: Vec::new(),
    };

    // move to the next significant token
    self.move_to_significant();

    while !matches!(self.peek().token_kind(), TokenKind::EndOfFile) {
        let declaration = match self.peek().token_kind() {
            TokenKind::Keyword(Keyword::Namespace) => {
                self.parse_namespace_declaration()
            }
            TokenKind::Keyword(Keyword::Using) => {
                self.parse_using_statement()
            }
            _ => self.append_error(Error::UnexpectedToken {
                context: Context::Program,
                found_token: self.peek().clone(),
                source_reference: self.source_code(),
            }),
        };

        if let Some(declaration) = declaration {
            program.declarations.push(declaration);
        } else {
            // skip to the next available declaration all delimeter
            self.skip_to(|token| {
                matches!(
                    token.token_kind(),
                    TokenKind::Keyword(Keyword::Namespace)
                        | TokenKind::Keyword(Keyword::Using)
                )
            });
        }

        // move to the next significant token
        self.move_to_significant();
    }

    Some(program)
}
Enter fullscreen mode Exit fullscreen mode

Summary

Phew!!😮‍💨😮‍💨 We've created a simple language parser that can understand the basic program structures. However, we're still going; in the next post, we'll continue writing the parser, which will parse the function declarations.

As always, you can find the source code for this post on GitHub

Feel free to comment if you have any thoughts, questions, feedbacks, or suggestions. I will be happy to read them 😊.

Top comments (2)

Collapse
 
codexi profile image
Isaac Vinícius

people love writing lexers/parsers, but they give up before the code generation...

Collapse
 
somnams profile image
Somnams

That is very helpful for me, I'm a compiler beginner, can I study the source code of this series ?