The role of a parser is turning text into an Abstract Syntax Tree, with minimum of hassle. In the previous episode, ANTLR 4 failed us utterly, as it merely generates a Concrete Syntax Tree.
So we need to do something about it, and here are a few different design patterns you can use to get either Abstract Syntax Tree, or at least Better Concrete Syntax Trees.
As far as I know, nothing like that is documented anywhere, as ANTLR 4 documentation is painfully Java only, and many techniques I'll be using are Python specific and wouldn't even work in Java.
Abstract and Concrete Syntax Trees
So what are different types of Syntax Trees, and why does it matter?
Let's take a very simple language for parsing just mathematical expressions, and our program is 2 + 3 * 4
.
Here's possible Abstract Syntax Tree for that (ignoring namespaces for these classes, debug info like line numbers etc.):
Add(
Number(2),
Multiply(
Number(3),
Number(4)
)
)
If we arrive at a tree like that, the parsing part is done, and we can move on with the rest of the program. That's the Abstract Syntax Tree - all the irrelevant details are removed, just meaningful structure remains, and the structure of the tree matches the logical structure of the program.
On the other hand, Concrete Syntax Tree instead returns a tree that very closely follows structure of the grammar. Concrete Syntax Tree for the same program might look like this:
Expr(
Expr(
Term(
Factor(
Number("2")
),
),
),
Operator("+"),
Term(
Term(
Factor(
Number("3")
)
),
Operator("*"),
Factor(
Number("4")
)
)
)
This is definitely not what we want! Nearly every parser generator, including older versions of ANTLR, generates AST-style results, because that is always the end goal, with CST at most being an intermediate representation created along the way, and most parsers don't even bother constructing that.
ANTLR 4 somehow made an insane decision to not do that, and only provide the CST, leaving it up to us to convert it to AST. This means any program using ANTLR 4 needs to include a lot of tedious boilerplate for CST to AST conversion. Or operate directly on CST, which is completely impractical except for maybe some toy cases.
Even for this simple program CST was already quite bad, but the more complex the grammar, the more complex CST gets, even for the same AST. If your table of operators has 15 precedence levels (like C), that means 15 extra levels of pointless trivial nodes in the CST! Totally crazy!
Annotated Grammar
Back in ANTLR 1-3 the way to get an AST was to put some actions in the grammar itself. Typically an action would do some part of AST construction. As AST construction is very closely driven by the syntax tree, this usually works beautifully.
ANTLR 4 makes it quite hard to put actions in the grammar, but they provided a few partial replacement features which we can use to get better CST at least.
So we take our Math.g4
:
grammar Math;
expr : expr ('+' | '-') term
| term;
term : term ('*' | '/') factor
| factor;
factor : '(' expr ')'
| number
| identifier;
number: NUM;
identifier: ID;
ID : [a-zA-Z_] [a-zA-Z0-9_]* ;
NUM : '-'? [0-9]+ ( '.' [0-9]*)?;
WS : [ \t\r\n]+ -> skip; // ignore all whitespace
And add some #
-annotations to alternative branches, as well as a=
/b=
annotations to parts of the match, getting this:.
grammar Math;
expr : a=expr '+' b=term # Add
| a=expr '-' b=term # Sub
| a=term # TrivialExpr
;
term : a=term '*' b=factor # Mul
| a=term '/' b=factor # Div
| a=factor # TrivialTerm
;
factor : '(' a=expr ')' # TrivialParensExpr
| a=NUM # Number
| a=ID # Identifier
;
ID : [a-zA-Z_] [a-zA-Z0-9_]* ;
NUM : '-'? [0-9]+ ( '.' [0-9]*)?;
WS : [ \t\r\n]+ -> skip; // ignore all whitespace
Let's see the resulting CST, and how much it's improved:
Sum(
TrivialExpr(
TrivialTerm(
Number("2")
),
),
Mul(
TrivialTerm(
Number("3")
),
Number("4")
)
)
This isn't quite AST, as Number
s are not aware they should be converting their string contents to floats, and we have a few Trivial*
nodes, but we took a good few steps towards our goal.
Let's update both Listener version, and Visitor version to use the new CST.
Updated Listener with Better CSTs
#!/usr/bin/env python3
from antlr4 import *
from MathLexer import MathLexer
from MathParser import MathParser
from MathListener import MathListener
import sys
class MathProgram(MathListener):
def exitNumber(self, node):
value = float(node.getText())
self.stack.append(value)
def exitIdentifier(self, node):
value = self.getVar(node.getText())
self.stack.append(value)
def exitAdd(self, node):
b = self.stack.pop()
a = self.stack.pop()
self.stack.append(a + b)
def exitSub(self, node):
b = self.stack.pop()
a = self.stack.pop()
self.stack.append(a - b)
def exitMul(self, node):
b = self.stack.pop()
a = self.stack.pop()
self.stack.append(a * b)
def exitDiv(self, node):
b = self.stack.pop()
a = self.stack.pop()
self.stack.append(a / b)
def getVar(self, name):
if name not in self.vars:
self.vars[name] = float(input(f"Enter value for {name}: "))
return self.vars[name]
def run(self, node):
self.stack = []
self.vars = {}
ParseTreeWalker().walk(self, node)
result = self.stack[0]
print(result)
def parseFile(path):
lexer = MathLexer(FileStream(path))
stream = CommonTokenStream(lexer)
parser = MathParser(stream)
tree = parser.expr()
MathProgram().run(tree)
if __name__ == "__main__":
path = sys.argv[1]
parseFile(path)
This is more code, but it's so much cleaner than what we had before. There are zero grammar checks like len(node.children) == 3
or node.children[1].getText() == "*"
. We know which branch matched, and some branches we had to check for before (all the Trivial
ones) we can now ignore, as their default action is to do nothing anyway.
This didn't require any extra work on Python code part, and we didn't even use the a=
and b=
annotations here (these are for the visitor pattern only).
One issue still remains - listener pattern is highly specific to the kind of processing we want to do. For our math program we could get away with simply using self.stack
and processing things in order, but this won't be so easy in general.
Updated Visitor with Better CSTs
Visitor will need some Python meta-programming. We'll need to implement slightly magical MathProgram.eval
to do our routing depending on type of each node, and for automatically skipping trivial nodes.
But the rest of MathProgram
is actually really nice, and almost follows what an AST-processing MathProgram
would do.
#!/usr/bin/env python3
from antlr4 import *
from MathLexer import MathLexer
from MathParser import MathParser
import sys
class MathProgram:
def __init__(self, program):
self.program = program
def evalAdd(self, node):
return self.eval(node.a) + self.eval(node.b)
def evalSub(self, node):
return self.eval(node.a) - self.eval(node.b)
def evalMul(self, node):
return self.eval(node.a) * self.eval(node.b)
def evalDiv(self, node):
return self.eval(node.a) / self.eval(node.b)
def evalNumber(self, node):
return float(node.getText())
def evalIdentifier(self, node):
return self.getVar(node.getText())
def getVar(self, name):
if name not in self.vars:
self.vars[name] = float(input(f"Enter value for {name}: "))
return self.vars[name]
def eval(self, node):
if not isinstance(node, ParserRuleContext):
raise Exception(f"{node} must be a node, not a {type(node)}")
name = "eval" + type(node).__name__[:-7]
if name[:11] == "evalTrivial":
return self.eval(node.a)
return self.__getattribute__(name)(node)
def run(self):
self.vars = {}
result = self.eval(self.program)
print(result)
def parseFile(path):
lexer = MathLexer(FileStream(path))
stream = CommonTokenStream(lexer)
parser = MathParser(stream)
tree = parser.expr()
MathProgram(tree).run()
if __name__ == "__main__":
path = sys.argv[1]
parseFile(path)
OK, what's going on here:
- before, we knew types of each subexpression (like
Term
), so we could callevalTerm
fromevalExpr
. But these were all stupid types, and we don't want them. Now thatTerm
was split betweenMul
,Div
, andTrivialTerm
,evalExpr
can't possibly know whatnode.b
is going to be. - and so to get the right method to be called, we use
eval
method, which checks type of the node and calls what it needs to -
MathProgram.eval
method is completely unrelated to Python'seval
global function, so it's perhaps a slightly confusing naming, even if it's not ambiguous to Python (self.eval
vseval
) -
evalAdd
etc. methods - they need to recursively callself.eval(node.a)
- but we really don't care about specific types, so it's much more readable thanself.evalExpr(node.a) + self.evalTerm(node.b)
would be - if we had an AST,
evalAdd
,evalSub
,evalMul
, andevalDiv
would likely just like they do now, huge win! -
evalNumber
is not perfect, as it need to dofloat(node.getText())
instead of having that number converted to a float during parsing as AST would, but it's good enough -
evalIdentifier
also needs to call.getText()
which does some calculations to build that string, AST would just have.name
or such - we could implement a bunch of methods like
evalTrivialExpr(self, node): return self.eval(node.a)
, but I madeeval
method handle all of these automatically - this automatically deals with the biggest difference between our "improved CST" which has those extra nodes, and a true "AST" which wouldn't.
Interestingly, in some languages you can implement multiple methods with the same name, and they'd dispatch on dynamic type of the argument, so we could implement def eval(self, node : DivContext)
etc. and it would pick the right one. It's not actually that common - most languages don't support that at all, or only support static version of it and the point is that we don't know argument type statically.
And in some other languages we could easily add extra methods to those generated classes, which would be another way to do it more cleanly. This is even sort of doable in Python, but it's a bit messy.
There are ways to do this with less aggressive meta-programming, like a bunch of ininstance
statements, or a dictionary keyed by class, but I think this is the most concise.
Abstract Syntax Tree
Well, so how about we go beyond the Better Concrete Syntax Tree idea, and just build a whole new AST. This is the best way if you need to do a lot of processing. You can leave the CST behind, and work with just very nice AST.
Here's one way to do this:
#!/usr/bin/env python3
from antlr4 import *
from MathLexer import MathLexer
from MathParser import MathParser
from collections import namedtuple
import sys
AddNode = namedtuple("AddNode", ["a", "b"])
SubNode = namedtuple("SubNode", ["a", "b"])
MulNode = namedtuple("MulNode", ["a", "b"])
DivNode = namedtuple("DivNode", ["a", "b"])
NumberNode = namedtuple("NumberNode", ["value"])
IdentifierNode = namedtuple("IdentifierNode", ["name"])
class MathAstBuilder:
def buildAdd(self, node):
return AddNode(self.build(node.a), self.build(node.b))
def buildSub(self, node):
return SubNode(self.build(node.a), self.build(node.b))
def buildMul(self, node):
return MulNode(self.build(node.a), self.build(node.b))
def buildDiv(self, node):
return DivNode(self.build(node.a), self.build(node.b))
def buildNumber(self, node):
return NumberNode(float(node.getText()))
def buildIdentifier(self, node):
return IdentifierNode(node.getText())
def build(self, node):
if not isinstance(node, ParserRuleContext):
raise Exception(f"{node} must be a node, not a {type(node)}")
name = "build" + type(node).__name__[:-7]
if name[:12] == "buildTrivial":
return self.build(node.a)
return self.__getattribute__(name)(node)
class MathProgram:
def __init__(self, program):
self.program = program
def evalAdd(self, node):
return self.eval(node.a) + self.eval(node.b)
def evalSub(self, node):
return self.eval(node.a) - self.eval(node.b)
def evalMul(self, node):
return self.eval(node.a) * self.eval(node.b)
def evalDiv(self, node):
return self.eval(node.a) / self.eval(node.b)
def evalNumber(self, node):
return node.value
def evalIdentifier(self, node):
return self.getVar(node.name)
def getVar(self, name):
if name not in self.vars:
self.vars[name] = float(input(f"Enter value for {name}: "))
return self.vars[name]
def eval(self, node):
name = "eval" + type(node).__name__[:-4]
return self.__getattribute__(name)(node)
def run(self):
self.vars = {}
result = self.eval(self.program)
print(result)
def parseFile(path):
lexer = MathLexer(FileStream(path))
stream = CommonTokenStream(lexer)
parser = MathParser(stream)
cst = parser.expr()
ast = MathAstBuilder().build(cst)
MathProgram(ast).run()
if __name__ == "__main__":
path = sys.argv[1]
parseFile(path)
Step by step:
- we need to define every AST node type - there's usually going to be a lot less than CST node types (usually around half)
- as these are super simple we can use
collections.namedtuple
to define a lot of them at once - we could also use dictionaries to represent nodes etc. - AST can be anything you want
-
MathAstBuilder.build
does AST construction, using the same metaprogramming as we had before - it's mostly independent of the kind of language we have, we just need to dig down the CST, skip trivial nodes, and call the appropriatebuild
method recursively - then we have
MathProgram
- in this case it's the same tree ofeval
as we had before, but that's just because it's such a simple kind of program. Normally this part would be the most complex part of your program, and contain all the actual logic. -
MathProgram
still uses some metaprogramming ineval
but we can get rid of that...
Abstract Syntax Tree with proper classes
If you don't like MathProgram.eval
and want your program to use proper classes and no magic beyond parsing, here's another way to do the AST. All the code in the logic in just normal OOP methods:
#!/usr/bin/env python3
from antlr4 import *
from MathLexer import MathLexer
from MathParser import MathParser
from collections import namedtuple
import sys
class AddNode:
def __init__(self, a, b):
self.a = a
self.b = b
def eval(self, context):
return self.a.eval(context) + self.b.eval(context)
class SubNode:
def __init__(self, a, b):
self.a = a
self.b = b
def eval(self, context):
return self.a.eval(context) - self.b.eval(context)
class MulNode:
def __init__(self, a, b):
self.a = a
self.b = b
def eval(self, context):
return self.a.eval(context) * self.b.eval(context)
class DivNode:
def __init__(self, a, b):
self.a = a
self.b = b
def eval(self, context):
return self.a.eval(context) / self.b.eval(context)
class NumberNode:
def __init__(self, value):
self.value = value
def eval(self, context):
return self.value
class IdentifierNode:
def __init__(self, name):
self.name = name
def eval(self, context):
return context.getVar(self.name)
class MathAstBuilder:
def buildAdd(self, node):
return AddNode(self.build(node.a), self.build(node.b))
def buildSub(self, node):
return SubNode(self.build(node.a), self.build(node.b))
def buildMul(self, node):
return MulNode(self.build(node.a), self.build(node.b))
def buildDiv(self, node):
return DivNode(self.build(node.a), self.build(node.b))
def buildNumber(self, node):
return NumberNode(float(node.getText()))
def buildIdentifier(self, node):
return IdentifierNode(node.getText())
def build(self, node):
if not isinstance(node, ParserRuleContext):
raise Exception(f"{node} must be a node, not a {type(node)}")
name = "build" + type(node).__name__[:-7]
if name[:12] == "buildTrivial":
return self.build(node.a)
return self.__getattribute__(name)(node)
class MathProgram:
def __init__(self, program):
self.program = program
def getVar(self, name):
if name not in self.vars:
self.vars[name] = float(input(f"Enter value for {name}: "))
return self.vars[name]
def run(self):
self.vars = {}
result = self.program.eval(self)
print(result)
def parseFile(path):
lexer = MathLexer(FileStream(path))
stream = CommonTokenStream(lexer)
parser = MathParser(stream)
cst = parser.expr()
ast = MathAstBuilder().build(cst)
MathProgram(ast).run()
if __name__ == "__main__":
path = sys.argv[1]
parseFile(path)
Abstractify Concrete Syntax Tree
And for the last design pattern, how about we don't build a new AST, but take the existing CST, but "abstractify" it by removing any trivial nodes, and doing any node-specific fixes, like converting strings to numbers? This is very easy in Python, as it's very dynamic. It would of course be completely impossible in Java with its rigid static type system.
#!/usr/bin/env python3
from antlr4 import *
from MathLexer import MathLexer
from MathParser import MathParser
import sys
class MathAbstractify:
def abstractifyAdd(self, node):
node.a = self.abstractify(node.a)
node.b = self.abstractify(node.b)
def abstractifySub(self, node):
node.a = self.abstractify(node.a)
node.b = self.abstractify(node.b)
def abstractifyMul(self, node):
node.a = self.abstractify(node.a)
node.b = self.abstractify(node.b)
def abstractifyDiv(self, node):
node.a = self.abstractify(node.a)
node.b = self.abstractify(node.b)
def abstractifyNumber(self, node):
node.value = float(node.getText())
def abstractifyIdentifier(self, node):
node.name = node.getText()
def abstractify(self, node):
if not isinstance(node, ParserRuleContext):
raise Exception(f"{node} must be a node, not a {type(node)}")
name = "abstractify" + type(node).__name__[:-7]
if name[:18] == "abstractifyTrivial":
return self.abstractify(node.a)
else:
return self.__getattribute__(name)(node) or node
class MathProgram:
def __init__(self, program):
self.program = program
def evalAdd(self, node):
return self.eval(node.a) + self.eval(node.b)
def evalSub(self, node):
return self.eval(node.a) - self.eval(node.b)
def evalMul(self, node):
return self.eval(node.a) * self.eval(node.b)
def evalDiv(self, node):
return self.eval(node.a) / self.eval(node.b)
def evalNumber(self, node):
return node.value
def evalIdentifier(self, node):
return self.getVar(node.name)
def getVar(self, name):
if name not in self.vars:
self.vars[name] = float(input(f"Enter value for {name}: "))
return self.vars[name]
def eval(self, node):
if not isinstance(node, ParserRuleContext):
raise Exception(f"{node} must be a node, not a {type(node)}")
name = "eval" + type(node).__name__[:-7]
return self.__getattribute__(name)(node)
def run(self):
self.vars = {}
result = self.eval(self.program)
print(result)
def parseFile(path):
lexer = MathLexer(FileStream(path))
stream = CommonTokenStream(lexer)
parser = MathParser(stream)
tree = parser.expr()
tree = MathAbstractify().abstractify(tree)
MathProgram(tree).run()
if __name__ == "__main__":
path = sys.argv[1]
parseFile(path)
Step by step:
- The program has two parts now -
MathAbstractify
to convert CST into basically AST, andMathProgram
to execute it -
MathAbstractify.abstractifyX
callsMathAbstractify.abstractify
on the node's children, and returns the updated node - for trivial nodes,
MathAbstractify.abstractify
simply removes the node and returns whatever isMathAbstractify.abstractify
of the child - so resulting tree will have no trivial node - for node that require special processing like
Number
andIdentifier
we do that, and assign the results to appropriate fields - the code in
MathAbstractify.abstractify
is very repetitive, and there are many ways to make it a lot more concise, like having a dictionary with each node's fields to update, but I didn't want to complicate this even further - after that our
MathProgram
is pretty much the same as our first Abstract Syntax Tree version - no need to deal with trivial nodes, and everything on the nodes is precomputed
This pattern is arguably the hackiest of them all, but we get most of the advantage of ASTs without having to define all the node types. MathAbstractify
could also be a lot more concise than what I did here.
Should you use Python ANTLR 4 now?
I think with these design patterns, it's a much better experience than with the official Java-style code documentation recommends.
ANTLR 4 is still likely the most powerful parser generator out there, and it deserves a lot better API than the one it got.
None of these design patterns I presented here are as clean as what we could have if ANTLR 4 supported direct AST generation, but they're useful to consider depending on the type of program you're writing. I definitely do not recommend following either of the official Java-style patterns I showed in the previous episode.
Code
All code examples for the series will be in this repository.
Code for the Abstract Syntax Trees with Python ANTLR 4 episode is available here.
Top comments (5)
Ever tried parser combinators? I guess in the end they're still recursive descend parsers but they don't feel like it
See an example here
Parser combinators are mostly a DSL for writing recursive descent parsers.
But yeah, it's a subject I could have covered, especially as there's some parser combinator libraries for mainstream languages like Python.
yep. I think they work better with languages with custom infix operators though
Haskell desperately needs more operators because you can't reuse normal infix operators due to lack of polymorphism in Haskell.
Python parsec just reuses
+
,|
,<<
,>>
etc. Example here.The only time I really needed more operators was with Ruby Z3.
Not really. All those operators work on extremely polimorphic typeclasses like Monoid,Applicative or Alternative