In the realm of programming, syntax analysis, commonly referred to as parsing, plays a pivotal role during the compilation process. The parser's primary responsibility is to transform a stream of tokens into an abstract syntax tree (AST).
In this article, we will introduce Pratt parsing, a renowned algorithm for parsing, often recognized as top-down operator precedence parsing, and we will demonstrate how to implement it using MoonBit.
Why Using Pratt Parser
Almost all programmers are familiar with infix expressions, including staunch Lisp/Forth programmers. At least, they are aware that the majority of the world writes arithmetic expressions in this way:
24 * (x + 4)
For compiler (or interpreter) writers, parsing such infix expressions is slightly more challenging than parsing the prefix expressions used in Lisp and the postfix expressions used in Forth. For instance, using a basic handwritten recursive descent parser requires multiple mutually recursive functions, and handling left recursion in the grammar for expression syntax can make the code less user-friendly, especially as the number of operators increases. Parser generator tools also do not offer a very satisfactory solution in this regard. Take a BNF example of a simple addition and multiplication expression:
Expr ::=
Factor
| Expr '+' Factor
Factor ::=
Atom
| Factor '*' Atom
Atom ::=
'number'
| '(' Expr ')'
This is definitely not very intuitive, and one might even need to spend time revisiting formal language courses taken in university. In some languages like Haskell, which supports custom infix operators, solving this with a parser generator tool is almost impractical.
In such situations, the Pratt parser emerges as an effective solution for the complexities associated with parsing infix expressions. Its design is not only adept at handling these challenges but also offers remarkable versatility. It also allows for the straightforward integration of new operators, all without the need to modify existing source code. Esteemed in the field of compiler construction, the Pratt parser is recommended for use in conjunction with recursive descent parsers in the renowned book "Crafting Interpreters". Additionally, it is used in the Rust-analyzer project.
Binding Power in Pratt Parser
In the Pratt parser, the concept used to describe associativity and precedence is called "binding power." For each infix operator, its binding power consists of a pair of integers - one for the left and one for the right. This is illustrated below:
expr: A + B * C
power: 3 3 5 5
And its function aligns perfectly with its name – the larger the number, the higher the precedence it has in grabbing a particular operand (in this example, A, B, C are operands).
The above example illustrates operators with different precedence levels, and the associativity of the same operator is represented by a pair of binding powers, one larger and one smaller.
expr: A + B + C
power: 1 2 1 2
In this particular instance, as the parser encounters B, the expression undergoes a transformation due to its stronger left binding power, resulting in the following configuration:
expr: (A + B) + C
power: 1 2
Next, let's delve into the practical application of Pratt parser and examine how it effectively employs this concept during execution.
Overview and Preliminary Preparation
The general framework of the Pratt parser is roughly as follows:
fn parseExpr(self : Tokens, min_bp : Int) -> Result[SExpr, String] {
......
while true {
.....
parseExpr()
}
}
From the above text, it can be seen that it is implemented using a combination of recursions and loops. This actually corresponds to two modes:
- Always have the leftmost expression at the innermost level, i.e.,
"1 + 2 + 3" = "(1 + 2) + 3"
, and this can be parsed using loops. - Always have the rightmost expression at the innermost level, i.e.,
"1 + 2 + 3" = "1 + (2 + 3)"
, and this can be parsed using recursion.
min_bp
is a parameter representing the binding power of an operator on the left that has not been parsed yet.
Since various errors may occur during this process, the return type of parseExpr
is Result[SExpr, String]
.
However, before starting to write the parser, we also need to tokenize the string to obtain a simple token stream.
enum Token {
Operand(String)
Operator(String)
Eof
} derive(Debug, Show, Eq)
struct Tokens {
mut tokens : List[Token]
}
This token stream requires three methods: peek()
, next()
, and eat()
.
The peek()
method retrieves the first token in the token stream without altering the state. In other words, it is side-effect-free; it merely takes a peek at the content to be processed. For an empty token stream, it returns Eof
.
fn peek(self : Tokens) -> Token {
match self.tokens {
Cons(t, _) => t
Nil => Eof
}
}
next()
consumes a token based on peek()
.
fn next(self : Tokens) -> Token {
match self.tokens {
Cons(t, ts) => {
self.tokens = ts
return t
}
Nil => Eof
}
}
Our goal is to take in a token stream and output a prefix expression without considering precedence.
enum SExpr {
Atom(String)
Cons(String, Array[SExpr])
}
Finally, we need a function to calculate the binding power of operators, which can be implemented using a simple match
statement. In practical implementation, to facilitate the addition of new operators, some key-value pair container should be used.
fn infix_binding_power(op : String) -> Result[(Int, Int), String] {
match op {
"+" => Ok((1, 2))
"-" => Ok((1, 2))
"/" => Ok((3, 4))
"*" => Ok((3, 4))
_ => Err("infix_binding_power(): bad op \(op)")
}
}
Parser Implementation
The initial steps are straightforward. First, take out the first token and assign it to the variable lhs
(short for left hand side, representing the left-hand side parameter).
- If it is an operand, store it.
- If it is a parenthesis (parentheses are also treated as an operator, but they are handled specially), recursively parse the first expression, then consume a pair of parentheses. The
?
here is a new error-handling syntax introduced by MoonBit. If the preceding expression isErr()
, the current function will terminate immediately, considering thisErr()
as the return value. - Any other results indicate a parsing issue, throw an error. Here, wrap an error string using
Err()
.
let mut lhs : Result[SExpr, String] = match self.next() {
Operand(s) => Ok(Atom(s))
Operator("(") => {
let expr = parseExpr(self, 0)?
self.eat(Operator(")"))
Ok(expr)
}
t => Err("bad token \(t)")
}
Next, let's take a look at the first operator:
- If the result is Eof at this point, it doesn't necessarily mean failure. An operand can also be considered a complete expression, so we can exit the loop.
- If the result is an operator, return it normally.
- Any other result indicates an error.
while true {
let op : Result[String, String] = match self.peek() {
Eof => { break }
Operator(op) => Ok(op)
t => Err("bad token \(t)")
}
let op = op?
......
Next, we need to determine which operator lhs
belongs to. Here, we use the min_bp
parameter, which represents the binding power of the nearest unfinished operator on the left. Its initial value is 0 (indicating no operator is competing for the first operand on the left). However, before that, we need to make a decision: is the operator a parenthesis? If it is, it means we are parsing an expression inside parentheses, and we should exit the loop directly. This is one of the reasons for using the peek
method because we cannot determine whether to consume the operator here.
After calculating the binding power l_bp
of the current operator op
, we first compare it with min_bp
:
- If
l_bp
is less thanmin_bp
, immediately break. This will returnlhs
to the upper level, where an operator is still waiting for the right-hand parameter. - Otherwise, use the
next
method to consume the current operator, and recursively callparseExpr
to obtain the right-hand parameter. The second parameter is the right binding power of the current operator. After a successful parsing, assign the result tolhs
and continue the loop.
if op != ")" {
let (l_bp, r_bp) = infix_binding_power(op)?
if l_bp < min_bp {
break
}
self.next()
let l = lhs?
let r = parseExpr(self, r_bp)?
lhs = Ok(Cons(op, [l, r]))
continue
}
break
At the end of the entire function, simply return lhs
.
Usage Example: Stack-based Calculator
Reverse Polish Notation (RPN) is a mathematical notation introduced by Polish mathematician Jan Łukasiewicz in 1920. The distinctive feature of this notation is that operators are written after operands. Behind this somewhat peculiar appearance lies a close connection to the stack data structure. For example, the infix expression
17 + 13 * 3
corresponds to the RPN expression 17 13 3 * +
. This can be implemented with a series of instructions.
Push(17)
Push(13)
Push(3)
Mul() // Take the top two elements from the stack, multiply them, and then put the result back on top of the stack.
Add() // As above, but replace multiplication with addition.
The conversion process from an arithmetic expression to a sequence of instructions is simple and direct.
fn compile(this : SExpr) -> List[Instruction] {
// Can only handle arithmetic expressions.
match this {
Atom(s) => Cons(Push(string_to_int(s)), Nil)
Cons("*", arr) => {
append(append(compile(arr[0]), compile(arr[1])), Cons(Mul, Nil))
}
Cons("+", arr) => {
append(append(compile(arr[0]), compile(arr[1])), Cons(Add, Nil))
}
}
}
For a detailed implementation of instruction interpretation, please see the complete code provided at the end of this article.
Conclusion
The Pratt parser is not only capable of handling infix operators but also versatile enough to cover postfix and prefix operators, as well as access operators like arr[i]
. The complete code example provided here includes handling for prefix operators:try.moonbitlang.cn/#99b19baf
Reference
Matklad. (Apr 13, 2020) Simple but Powerful Pratt Parsing. https://matklad.github.io/2020/04/13/simple-but-powerful-pratt-parsing.html
Top comments (0)