Lossless Syntax Trees

#rust #parsing #languagedesign #compilers

So you want to parse a programming language. You want to turn some text into a semantic data structure representing the structure of the program as written by the user.

But how exactly do you structure such a semantic model? A Parse Tree (sometimes called a Concrete Syntax Tree, or CST) is what a grammar-based parsing tool like ANTLR typically produces for this purpose. This faithfully represents the structure of the formal parse tree, including meaningless syntax that’s there for disambiguation but often excluding trivia that's allowed to be anywhere, such as whitespace and comments. The traditional improvement to this approach is something called an Abstract Syntax Tree, or AST for short, which models only the meaningful parts of the grammar.

However, for our IDE-ready compiler, there are two key properties that we have to fulfill: the parser needs to be error-tolerant, and it should be roundtripable. We need error-tolerance because even in the face of an incomplete or incorrect input, we need to produce a syntax tree so we can provide things like syntax highlighting and completion on in-progress code. Code in the IDE spends most of its time in an incomplete state as it is being written. As for roundtrippable, we want to be able to transform the syntax tree in the IDE and maintain things like formatting and comments. To get both of these properties, we use rowan, a library implementing a more recent syntax tree design called a Lossless Syntax Tree^†.

The key trick of the rowan LST is structuring a dynamically typed CST such that we can provide a structurally typed AST view into the CST. The untyped nature of the CST also provides for representation of our error tolerance: any node can have arbitrary children, so long as our parser is smart enough to figure out what node is where.

This uses a kind of red-green tree like is used in Roslyn. The green tree is the CST, and is structured like a normal singly-linked tree: each node has a list of its children. The tree is dynamically typed, as it only stores a #[repr(u16)] enum SyntaxKind^‡ alongside the textual width of the node. The red tree is a view into the green tree which is dynamically constructed on-demand, and contains parent pointers along with absolute offsets. The AST layer adds extra type information onto the red tree for what you can do with any given node, though all accessors necessarily return an Option to represent the dynamicism of the concrete tree that allows for it to represent incorrect source code.

To build our compiler to this point, we need to first do a bit of refactoring. The design so far is solid, but a few tweaks will prepare it for the parser. First, we should update any out-of-date dependencies. cargo upgrade--all --allow-prerelease does the trick, giving at the time of writing, just an upgrade from tera@1.0.0-beta.16 to tera@1.0.0-beta.18.

Now, let's remind ourselves of the grammar we're working with:

Program = Statement*;
Statement =
  | If:{ "if" cond:(Expression::Parenthesized) then:Statement { "else" else:Statement }? }
  | While:{ "while" cond:(Expression::Parenthesized) then:Statement }
  | Block:{ "{" then:Statement* "}" }
  | Expression:{ then:Expression? ";" }
  ;
Expression =
  | Parenthesized:{ "(" Expression ")" }
  | Assignment:{ id:Identifier "=" val:Expression }
  | Comparison: #[associativity(left)] { lhs:Expression "<" rhs:Expression }
  | Addition: #[associativity(left)] { lhs:Expression "+" rhs:Expression }
  | Subtraction: #[associativity(left)] { lhs:Expression "-" rhs:Expression }
  | Term:Term
  ;
Term =
  | Identifier:Identifier // token
  | Integer:Integer // token
  | Expression:(Expression::Parenthesized)
  ;

(If you're following along as these are posted, you may note that this is different than the grammar I've used so far; this comes from starting implementing the parser and finding improvements.)

Here we see that a Program is made up of zero-or-more Statement; a Statement is either Statement::If, Statement::While, Statement::Block, or Statement::Expression; and Statement::If is a literal if followed by an Expression::Parenthesized and a Statement, optionally followed by a literal else and another Statement. This describes the structure of a correct tree. We add the nonterminal symbols to our syntax.toml:

nonterminals = [
    "program",
    "statement if",
    "statement while",
    "statement block",
    "statement expression",
    "expression parenthesized",
    "expression assignment",
    "expression comparison",
    "expression addition",
    "expression subtraction",
    "expression term",
    "term identifier",
    "term integer",
    "term expression",
]

In addition, I upgraded the error kind from being listed in the metadata file to being explicitly listed last in the template file. While it is a node kind, it's different than the real nodes of a proper tree, and should be treated specially. Now, we should rename our existing SyntaxKind to TokenKind to clarify that it's just for tokens, then create a new template for SyntaxKind, which includes the nonterminals:

{%- set empty = [] -%}
// Separate variables for terminals and nonterminals
{%- set terminals = empty
    | concat(with=keywords)
    | concat(with=literals)
    | concat(with=punctuation | map(attribute="name"))
    | concat(with=tokens)
-%}
{%- set all_kinds = empty
    | concat(with=terminals)
    | concat(with=nonterminals)
-%}

#[repr(u16)]
#[allow(missing_docs)]
// Add serde to runtime dependencies.
// TokenKind should impl `Serialize` as well.
#[derive(serde::Serialize)]
#[derive(Copy, Clone, Debug, Eq, PartialEq, Hash)]
pub enum SyntaxKind {
    {%- for kind in all_kinds %}
    {{ kind | camel_case }},
    {%- endfor %}
    ERROR,
}

impl SyntaxKind {
    /// The name of this syntax kind.
    pub fn name(self) -> &'static str {
        match self {
            {%- for kind in all_kinds %}
            SyntaxKind::{{ kind | camel_case }} => "{{ kind | camel_case }}",
            {%- endfor %}
            SyntaxKind::ERROR => "ERROR",
        }
    }
}

// We need to convert from TokenKind to SyntaxKind.
// This could be explicitly zero-cost by using a transmute,
// but we stick here to the safe version and let it optimize.
impl From<TokenKind> for SyntaxKind {
    fn from(token: TokenKind) -> SyntaxKind {
        match token {
            {%- for kind in terminals %}
            TokenKind::{{ kind | camel_case }} => SyntaxKind::{{ kind | camel_case }},
            {%- endfor %}
            TokenKind::ERROR => SyntaxKind::ERROR,
        }
    }
}

// For rowan, we also want conversions to/from u16.
impl From<SyntaxKind> for u16 {
    fn from(kind: SyntaxKind) -> u16 {
        kind as u16
    }
}

// This could be explicitly zero-cost by using a transmute,
// but we stick here to the safe version and let it optimize.
impl From<u16> for SyntaxKind {
    fn from(raw: u16) -> SyntaxKind {
        match raw {
            {%- for kind in all_kinds %}
            {{ loop.index }} => SyntaxKind::{{ kind | camel_case }},
            {%- endfor %}
            _ => SyntaxKind::ERROR,
        }
    }
}

We also want to pull the Token struct from tinyc_lexer into tinyc_grammar, and re-export it from tinyc_lexer instead:

use serde::ser::{Serialize, Serializer};

include!(concat!(env!("OUT_DIR"), "/token_kinds.rs"));
include!(concat!(env!("OUT_DIR"), "/syntax_kinds.rs"));

#[derive(Copy, Clone, Debug, Eq, PartialEq, Hash)]
pub struct Token {
    pub kind: TokenKind,
    pub len: u32,
}

impl Serialize for Token {
    fn serialize<S>(&self, serializer: S) -> Result<S::Ok, S::Error>
    where
        S: Serializer,
    {
        serializer.serialize_newtype_variant(
            "Token",
            u16::from(self.kind).into(),
            self.kind.name(),
            &self.len,
        )
    }
}

It fits better in tinyc_grammar, which is our crate for raw data boxes than tinyc_lexer, which is our crate implementing the lexer. Other crates could easily want to consume Token without depending on the lexer.

With these improvements, we're ready to start implementing the parser next time.

Discuss this below! | See the result on GitHub!

How did I do on code density this time? Last time felt a bit too much here's the code, so this one tries to focus a bit more on concepts and less on concrete implementation.

† You may also see a Lossless Syntax Tree referred to as a Concrete Syntax Tree. I prefer LST for this concept as a Parse Tree is somewhat commonly called a CST (as mentioned above) and is explicitly not what the LST is. Often times, an LST will simply be called Syntax Tree as well. whether Syntax Tree means CST, AST, or LST is left up to context. There are, of course, also alternate approaches to syntax trees that fall somewhere in-between these three categories.

‡ Actually, rowan uses a struct SyntaxKind(u16) internally and translates between that and the user's syntax kind enum at a higher level. This reduces generics proliferation and allows the library user to use a more complicated sum type than just a C-style enum if desired.