So far we tried two parser generators for Python - PLY, and ANTLR 4 (which took two episodes - one, two).
Time for another one - SLY, a successor to PLY. It doesn't do anything special parsing-wise, it's just just another run-of-the-mill LR style parser generator, its main selling point is a much nicer Python interface.
Math Language Parser
Let's start with something very simple - a program to parse and run our "math" language, the same one I created 7 versions of for ANTLR 4 episodes. In SLY it's so much more concise:
#!/usr/bin/env python3
from sly import Lexer, Parser
import sys
class MathLexer(Lexer):
tokens = { PLUS, MINUS, TIMES, DIVIDE, LPAREN, RPAREN, NUM, ID }
ignore = " \t\r\n"
PLUS = r"\+"
MINUS = r"-"
TIMES = r"\*"
DIVIDE = r"/"
LPAREN = r"\("
RPAREN = r"\)"
NUM = r"-?[0-9]+(\.[0-9]*)?"
ID = r"[a-zA-Z_][a-zA-Z0-9_]*"
class MathParser(Parser):
tokens = MathLexer.tokens
def __init__(self):
self.vars = {}
@_("expr PLUS term")
def expr(self, p):
return p.expr + p.term
@_("expr MINUS term")
def expr(self, p):
return p.expr - p.term
@_("term")
def expr(self, p):
return p.term
@_("term TIMES factor")
def term(self, p):
return p.term * p.factor
@_("term DIVIDE factor")
def term(self, p):
return p.term / p.factor
@_("factor")
def term(self, p):
return p.factor
@_("LPAREN expr RPAREN")
def factor(self, p):
return p.expr
@_("NUM")
def factor(self, p):
return float(p.NUM)
@_("ID")
def factor(self, p):
return self.getVar(p.ID)
def getVar(self, name):
if name not in self.vars:
self.vars[name] = float(input(f"Enter value for {name}: "))
return self.vars[name]
if __name__ == "__main__":
path = sys.argv[1]
with open(path) as f:
text = f.read()
lexer = MathLexer()
parser = MathParser()
result = parser.parse(lexer.tokenize(text))
print(result)
We can run it on the same three examples:
a.math
- operator precedence test:
300 + 50 * 4 + 80 / 4 - (80 - 30) * 2
miles_to_km.math
- unit converter:
miles * 1.60934
circle_area.math
- a test program to verify it asks for same variable only once:
3.14159265359 * r * r
And we can try to run it:
$ ./math.py math/a.math
420.0
$ ./math.py math/miles_to_km.math
Enter value for miles: 420
675.9228
$ ./math.py math/circle_area.math
Enter value for r: 69
14957.12262374199
Let's follow how it works step by step:
Lexer
Lexer is the part responsible for chopping up the input text into tokens. So a text like 2 + 3 * 4
becomes [NUM(2), PLUS, NUM(3), TIMES, NUM(4)]
.
It very tiny, we just define a set of 8 tokens
in our language, regular expressions for each of them, and then also some ignore
rules to skip any extra whitespace between the tokens:
class MathLexer(Lexer):
tokens = { PLUS, MINUS, TIMES, DIVIDE, LPAREN, RPAREN, NUM, ID }
ignore = " \t\r\n"
PLUS = r"\+"
MINUS = r"-"
TIMES = r"\*"
DIVIDE = r"/"
LPAREN = r"\("
RPAREN = r"\)"
NUM = r"-?[0-9]+(\.[0-9]*)?"
ID = r"[a-zA-Z_][a-zA-Z0-9_]*"
We could add some error handling there with error
method. Also apparently SLY wants us to maintain self.lineno
manually for error messages, which is weirdly common for parser generators, and feels really inappropriate for a language like Python, but I skipped that part.
We could do some pre-processing here, like converting value carried by NUM
from a string to a float, but it's not really necessary, we can do it in parsing stage as well.
Parser
Parser has very usual rule, but it encodes them in a very unusual way:
class MathParser(Parser):
tokens = MathLexer.tokens
def __init__(self):
self.vars = {}
@_("expr PLUS term")
def expr(self, p):
return p.expr + p.term
@_("expr MINUS term")
def expr(self, p):
return p.expr - p.term
@_("term")
def expr(self, p):
return p.term
@_("term TIMES factor")
def term(self, p):
return p.term * p.factor
@_("term DIVIDE factor")
def term(self, p):
return p.term / p.factor
@_("factor")
def term(self, p):
return p.factor
@_("LPAREN expr RPAREN")
def factor(self, p):
return p.expr
@_("NUM")
def factor(self, p):
return float(p.NUM)
@_("ID")
def factor(self, p):
return self.getVar(p.ID)
def getVar(self, name):
if name not in self.vars:
self.vars[name] = float(input(f"Enter value for {name}: "))
return self.vars[name]
You might have noticed a lot of methods with the same name. That's how SLY encodes alternatives. To say that expr
can we one of three things (expr PLUS term
, expr MINUS term
, or term
), you define three expr
methods, each with different @_
decorator.
Match argument is passed to each of those methods. If certain sub-match occurs once, you can refer to it with p.expr
or such. If it occurs multiple times, you'd need to use p.expr0
, p.expr1
, etc.
The __init__
and getVar
are specific to just our math program, and not related to SLY.
Running it
And finally we run the program like this:
if __name__ == "__main__":
path = sys.argv[1]
with open(path) as f:
text = f.read()
lexer = MathLexer()
parser = MathParser()
result = parser.parse(lexer.tokenize(text))
print(result)
There's no tokenizeFile
, so we need to read it manually. Once we do that, we can create Lexer
and Parser
objects, and call result = parser.parse(lexer.tokenize(text))
. The API is very simple.
Should you use SLY?
SLY is not going to win any Excellence in Parsing awards. It's just another LR parser generator, so it has limited power, cryptic shift reduce error messages if you mess up, poor error messages and error recovery for user syntax errors, all the usual LR issues.
On the other hand it has fantastic Python API, so if you want to build something fast that's not too complicated, it's much less work than figuring out ANTLR 4, or writing your own parser by hand.
I wouldn't recommend it for more complex languages, but there's a lot of simple cases SLY is a perfect fit for.
Code
All code examples for the series will be in this repository.
Top comments (0)