Parser generators existed since the late 1960s, but even today they still see very little use for parsing programming languages. Almost every parser out there is either hand-coded, or uses a parser generator with so many hacks that it might just as well be (Ruby).
The primary reason is that parser generators just aren't good enough. ANTLR is doing its best to be that good enough parser generator.
In terms of its ability to parse things, it's likely the best one out there. ANTLR is built upon different principles (LL) than other parser generators (historically LR, nowadays also occasionally PEG), but we won't be getting into any theory here today. None of the parser generators strictly adheres to these categories anyway.
The worst thing about ANTLR 4 is Java. The generator itself is Java-based, and is supports a short list of languages - with many previously supported languages for ANTLR 1-3 like Ruby no longer there. But worst of all, its API for the usually nice languages like Python is still painfully Java-like, full of boilerplate code, and very unpleasant to use.
Math Language
For this episode I'll implement a language for doing simple math calculations. It can ask for values it doesn't know, and it prints the final results. Here's a few programs for it.
a.math
- just testing that all the rules are correct:
300 + 50 * 4 + 80 / 4 - (80 - 30) * 2
miles_to_km.math
- very simple unit conversion program:
miles * 1.60934
circle_area.math
- a program to check that we'll be asked about r
only once, not twice:
3.14159265359 * r * r
A language like that is nothing special, but it's at least tiny bit useful.
Math Language In Action
I'll start a bit backwards, and show our goal first:
$ ./listener/Math.py examples/a.math
420.0
$ ./listener/Math.py examples/miles_to_km.math
Enter value for miles: 69
111.04446
$ ./listener/Math.py examples/circle_area.math
Enter value for r: 20
1256.6370614359998
That's the interactions we want. For every variable in the program, it asks us for its value. Everything is a float, all the usual math rules are followed, and result is automatically printed out.
Dependencies
There are two things we need - antlr
itself, which we'll only need to compile .g4
grammar file into a bunch of .py
Python modules. And antlr-python3-runtime
, which our program will need to run. The antlr-python3-runtime
should be added to your program's dependencies.
$ brew install antlr
$ pip3 install antlr4-python3-runtime
Math.g4
So here's the grammar file. It will be identical for both versions that follow:
grammar Math;
expr : expr ('+' | '-') term
| term;
term : term ('*' | '/') factor
| factor;
factor : '(' expr ')'
| number
| identifier;
number: NUM;
identifier: ID;
ID : [a-zA-Z_] [a-zA-Z0-9_]* ;
NUM : '-'? [0-9]+ ( '.' [0-9]*)?;
WS : [ \t\r\n]+ -> skip; // ignore all whitespace
To compile it, run antlr -Dlanguage=Python3 Math.g4
. It will then generate three files MathLexer.py
, MathParser.py
, and MathListener.py
, which you can import like any Python modules. Also a few extra debug files that you won't need for anything, I'm not really sure why it bothers generating them by default.
So let's explain the grammar:
-
grammar Name;
- should generally match the file name -
rule : definition;
- how various parts of the grammar are defined, parser rules are lower case, lexer rules are upper case -
expr : expr ('+' | '-') term | term;
- this meansexpr
is one of:expr + term
,expr - term
, orterm
-
term : term ('*' | '/') factor | factor;
- this meansterm
is one of:term * factor
,term / factor
, orfactor
-
factor : '(' expr ')' | number | identifier
- this means is one of:( expr )
,number
, oridentifier
-
number
is a float, by regular expression -
identifier
is any name, by regular expression -
WS
is whitespace, we do-> skip
as special rule to tell ANTLR to just ignore it between tokens
Now if you somehow remember anything about parsing theory, you might have noticed that expr
and term
definitions are "left recursive", that is they start by referring to themselves. Traditional LL parsers can't deal with it, because they'd work as this:
- is what follows
expr
? let's find out! First alternative isexpr ('+' | '-') term
, so lets check this - is what follows
expr ('+' | '-') term
? Let's do this one at a time, starting withexpr
- is what follows
expr
? Oh wait, we're in an infinite loop
So traditionally, if you wanted to use LL grammar, you'd need to rewrite such rules, for example to expr: term (('+'|'-') term)*
. And internally it's sort of true, but ANTLR 4 does all of that behind the scenes, supporting every directly left recursive grammar.
It still doesn't automatically support indirect left recursion (a
starts with b ...
and b
starts with a ...
) - so for such grammars you'd need to give it a hand, but directly left recursive grammars are extremely common (like, everything that includes math expressions), while indirectly left recursive grammars are rare. And all the usual tricks for dealing with them still work.
How ANTLR works
We are ready to parse things, but to write a program to use the parser we need to do some extra work. ANTLR 4 supports three main ways:
First, we can put code directly inside .g4
files - this is how most parser generators work, but ANTLR 4 hates this idea. Python target for ANTLR 4 hates it even more, as it just refuses to handle Python indentation in any sensible way. It's weird just how much they try to make you not do this, especially since this is how previous versions of ANTLR worked, it's how most other parser generators work, and it's generally the most concise way. They give stupid arguments that then you can't share your grammar between different languages, but that's almost never relevant. Anyway, I'm not going to force ANTLR to do it this way, at least not for this episode.
Second, we can construct a tree, and then use a "visitor" pattern. A visitor is passed top level node corresponding to the whole program, and we manually go down the parse tree whichever way we want. The best thing about it is that the visitor naturally lends itself to a series of methods like parseSomeType
, each of which parses node of someType
and returns whatever is appropriate. The downside is that it's a lot of boilerplate code for the recursion.
Third, we can use a "listener", and that's the third file ANTLR generated. With a listener, we still construct a parse tree, but instead of doing recursing ourselves, we have a walker walk the tree and call enterNodeType
and exitNodeType
for us - with all of these defaulting to empty functions, so we only need to override the ones we want. The downside is that we cannot return anything - it turns the whole program into a series of enter
and exit
events, we need to pass the data by some side channel.
There are some other ways, but let's stick to that.
Visitor
Here's implementation of our language following the "visitor" pattern:
#!/usr/bin/env python3
from antlr4 import *
from MathLexer import MathLexer
from MathParser import MathParser
import sys
class MathProgram:
def __init__(self, program):
self.program = program
def evalExpr(self, node):
children = node.children
if len(children) == 3:
a = self.evalExpr(children[0])
b = children[1].getText()
c = self.evalTerm(children[2])
if b == "+":
return a + c
else:
return a - c
else:
return self.evalTerm(children[0])
def evalTerm(self, node):
children = node.children
if len(children) == 3:
a = self.evalTerm(children[0])
b = children[1].getText()
c = self.evalFactor(children[2])
if b == "*":
return a * c
else:
return a / c
else:
return self.evalFactor(children[0])
def evalFactor(self, node):
children = node.children
if len(children) == 3:
return self.evalExpr(children[1])
elif node.number():
return float(node.number().getText())
else:
return self.getVar((node.identifier().getText()))
def getVar(self, name):
if name not in self.vars:
self.vars[name] = float(input(f"Enter value for {name}: "))
return self.vars[name]
def run(self):
self.vars = {}
result = self.evalExpr(self.program)
print(result)
def parseFile(path):
lexer = MathLexer(FileStream(path))
stream = CommonTokenStream(lexer)
parser = MathParser(stream)
tree = parser.expr()
MathProgram(tree).run()
if __name__ == "__main__":
path = sys.argv[1]
parseFile(path)
Step by step:
- there's considerable boilerplate for what should just be
MathParser.parseFile(path).expr()
- in ANTLR, there's no concept of "top level" grammar rule, you can start with anything, that's why we explicitly call
expr
- we can just as well ask fornumber
or anything else. In some other kinds of parser generators, there's a single special top level rule (for example Raku's builtin PEG parsers call itTOP
). - the API is really ugly - for most obvious example whoever wrote it decided to go with Java-level cringe
.getText()
instead of declaring.text
as@property
..children
only works because it's directly an attribute, officially they'd rather you use the same Java cringe.getChildren()
,.getChildCount()
etc. - if we could put code directly in the grammar, we could do
expr: expr '+' term { code } | expr '-' term { code } | term { code }
, so we wouldn't need to figure out which branch we followed the second time, even though the parser already did that - all the
self.evalSomething(children[i])
feels like annoying boilerplate code
Listener
The alternative to this is "listener" pattern. Which sounds sort of like we're heading for a SAX-style XML parser, but it's not really true. We have access to the whole tree available at all times. Here's the implementation:
#!/usr/bin/env python3
from antlr4 import *
from MathLexer import MathLexer
from MathParser import MathParser
from MathListener import MathListener
import sys
class MathProgram(MathListener):
def exitNumber(self, node):
value = float(node.getText())
self.stack.append(value)
def exitIdentifier(self, node):
value = self.getVar(node.getText())
self.stack.append(value)
def exitTerm(self, node):
if len(node.children) == 3:
b = self.stack.pop()
a = self.stack.pop()
if node.children[1].getText() == "*":
self.stack.append(a * b)
else:
self.stack.append(a / b)
def exitExpr(self, node):
if len(node.children) == 3:
b = self.stack.pop()
a = self.stack.pop()
if node.children[1].getText() == "+":
self.stack.append(a + b)
else:
self.stack.append(a - b)
def getVar(self, name):
if name not in self.vars:
self.vars[name] = float(input(f"Enter value for {name}: "))
return self.vars[name]
def run(self, node):
self.stack = []
self.vars = {}
ParseTreeWalker().walk(self, node)
result = self.stack[0]
print(result)
def parseFile(path):
lexer = MathLexer(FileStream(path))
stream = CommonTokenStream(lexer)
parser = MathParser(stream)
tree = parser.expr()
MathProgram().run(tree)
if __name__ == "__main__":
path = sys.argv[1]
parseFile(path)
You need to figure out how you'll handle the data. Because how simple our language is, we could get away with just self.stack
, and pushing and popping it.
Step by step:
- it's even more boilerplate to get started, additional
ParseTreeWalker().walk(self, node)
is needed, even though we inherit from a base class that really could have that logic itself - the API reeks of Java just much as the previous one
- we managed to skip all the
enter
nodes, and most of theexit
nodes - we can also skip the rules likeexpr: term
,term: factor
, andfactor: ( expr )
as numerical value isn't changed there - so at least for this simple program it's less verbose than the visitor version - you need to figure out how you'll be passing data around on your own - there's no universal solution
These are just two basic design patterns recommended by ANTLR. You can absolutely use another pattern, like for example add extra methods directly to each node type and run that, put code directly into the grammar, or some even different way.
Should you use Python ANTLR 4?
ANTLR 4 is probably the most powerful parser generator out there, so in this sense it's a good choice, even if the API is just nasty.
Usually parsing will be only a small part of your program, so you can pinch your nose and get over the foul stench of Java. It might very well be a better choice than writing a recursive descent parser yourself.
As for improving ANTLR 4 itself, there are two ways. From what I've seen, porting ANTLR 4 to a new language would require some amount of effort, but very little difficulty. It's basically around 10k lines of code to rewrite class-by-class, to implement ANTLR runtime in your language, and far less to make code generator work. You won't need to understand any of the underlying theory. It seems like a reasonable side project for a few weeks.
The other way, and something I really hope someone does, is to figure out some design patterns that fit non-Java languages better, and pull request those.
For Python, it's just crazy that these aren't supported:
node.text
SomeParser().parseFile(path).aNodeType()
But it would take a lot more than that to make it feel like it fits the language.
Code
All code examples for the series will be in this repository.
Top comments (2)
Raku's parsers aren't PEGs. See Comparing/contrasting Raku Grammars and PEGs.
The
TOP
rule is only special in the sense it's the rule called to enter a grammar if you don't specify which rule to start at.You can specify any rule as the parsing entry point. Just pass a
rule => 'rule-name'
argument to a parsing routine.No parser generator really follows the theoretical classes (LR, LL, GLR, PEG etc) exactly, but these are still best ways to describe parsing strategy, and you can't ignore which class you have, as each of them forces grammars to be built certain ways, and won't work otherwise. Raku grammars are definitely PEG style, and very far from other parsing styles.
Even the article you linked states "Raku Grammars: Some folk describe Raku Grammars as PEGs". Their counter-argument that Raku supports putting semantic predicates in the grammar is really weird, as most parser generators do, like ANTLR for sure, and even Bison.
The thing about having dedicated TOP rule or not is more a note about API design. LL and PEG style parsers can generally start from any rule. ANTLR 4 doesn't seem to have any way to designate any rule as default TOP rule for more concise API, I think what Raku is doing here with default rule makes sense.