## DEV Community

Tomasz Wegrzanowski

Posted on

# 100 Languages Speedrun: Episode 14: Recursive Descent Parser

Back in episode 8, we took our first steps towards creating a new programming language, and implemented a tokenizer. If you didn't read that episode, I recommend doing it now, otherwise this episode might be a bit confusing.

### Our language

We'll create a very simple language for doing math.

It will have one `statement` per line, and there will only be these statements:

• reading number from user and saving it to variable `read name`
• printing variable `print name`
• setting variable `set name = expression`

Any empty line will be ignored. `#` will start a comment that will go until end of current line.

So a program could look like this:

``````# Program for converting miles to km
set kms = miles * 1.60934
print kms
``````

### What is an `expression`?

It's very clear what a `statement` is, but what is an `expression`?

Let's write down the possibilities. These would be enough for our simple language:

• a number like `1.60934`
• `( expression )`
• `expression + expression`
• `expression - expression`
• `expression * expression`
• `expression / expression`

And we have the whole language!

Well, almost. There's just tiny problem of ambiguity. Right now the language has no idea if `3 + 4 * 5` means `3 + ( 4 * 5 )` or `(3 + 4) * 5`. It also doesn't know if `2 - 3 - 4` is `2 - (3 - 4)` or `(2 - 3) - 4`.

There are two approaches to solving this in use. One is to tell our parser "operator precedence" (if `+` or `*` goes first), and for each operator its associativity (left or right or not allowed - it determines which way `2 - 3 - 4` is parsed).

The other is to rewrite grammar to get rid of such ambiguity, and that's generally the simpler approach.

### Unambiguous grammar

Instead of having `expression` with 6 possibilities, it will now only have these:

• `expression + product`
• `expression - product`
• `product`

Then `product` is:

• `product * factor`
• `product / factor`
• `factor`

And `factor` is:

• `( expression )`
• `number`

Expressed this way, it's impossible to have ambiguities. You can verify it yourself. A product cannot possibly have anything with `+` on either side, unless we use parentheses, because none of the rules it refers to has `+`. And `2 - 3 - 4` cannot possibly mean `2 - (3 - 4)` because the right side of `-` is a `product`, so it cannot have any `-` in it, without explicit parentheses.

By the way don't think too much about names like `product` or `factor`. I don't find them terribly intuitive myself. You could call it `expression2`, `expression3` or such.

A lot of "parser generator" tools accept definitions like what we just came up with. Unfortunately for this episode we won't be using a parser generator, we'll be writing Recursive Descent Parser ourselves, and it requires grammars to be tweaked a bit.

### Grammar for Recursive Descent Parser

I didn't discuss what Recursive Descent Parser are yet. We'll get to it soon.

But a big limitation of Recursive Descent Parser is that none of the parsing rules can refer to itself on the left, because then it would goes into infinite recursive loop.

So let's rewrite our definitions in a way that won't have this issue:

`expression` is:

• `product ( ("+"|"-") product )*`

`product` is:

• `factor ( ("*"|"/") factor ]*)`

And `factor` is:

• `"(" expression ")"`
• `number`

I had to extend the notation a bit. `a|b` means `a` or `b`. `( ... )*` means zero or more of whatever's in the parentheses. And as there are special characters now, I'm quoting all literal symbols like `"+"`, or `"("`.

### Tokenizer

OK, let's start with tokenizer. It will use `StringScanner` from Ruby standard library, and some regexp. Here's tokenizer program for our math language:

``````#!/usr/bin/env ruby

require "strscan"
require "pathname"

class MathLanguage
def initialize(path)
end

def tokenize(string)
scanner = StringScanner.new(string)
tokens = []
until scanner.eos?
if scanner.scan(/\s+/)
# do nothing
elsif scanner.scan(/#.*/)
# comments - ignore rest of line
elsif scanner.scan(/-?\d+(?:\.\d*)?/)
tokens << { type: "number", value: scanner.matched.to_f }
tokens << { type: scanner.matched }
elsif scanner.scan(/[a-zA-Z][a-zA-Z0-9]*/)
tokens << { type: "id", value: scanner.matched }
else
raise "Invalid character #{scanner.rest}"
end
end
tokens
end

def call
@lines.each do |line|
tokens = tokenize(line)
puts "Line: \"#{line.chomp}\" has tokens:"
tokens.each do |token|
p token
end
end
end
end

unless ARGV.size == 1
STDERR.puts "Usage: #{\$0} file.math"
exit 1
end

MathLanguage.new(ARGV[0]).call
``````

If we run it, we can see it turns all lines into tokens, and skips comments:

``````\$ ./math_tokenizer.rb miles_to_km.math
Line: "# Program for converting miles to km" has tokens:
{:type=>"id", :value=>"miles"}
Line: "set kms = miles * 1.60934" has tokens:
{:type=>"set"}
{:type=>"id", :value=>"kms"}
{:type=>"="}
{:type=>"id", :value=>"miles"}
{:type=>"*"}
{:type=>"number", :value=>1.60934}
Line: "print kms" has tokens:
{:type=>"print"}
{:type=>"id", :value=>"kms"}
``````

### What is a Recursive Descent Parser?

We have a list of definitions. Now what is this Recursive Descent Parser I've been alluding to?

It's simply a bunch of method (or functions if you don't like OOP terminology) - one per grammar rule. Each method consumes some of the tokens from the left, and returns parsed content.

For Recursive Descent Parser, methods for rule `X` should expect that `X` will be on the left of the list, and raise exception if it isn't - but they should be OK with there being some leftover tokens.

As we're only parsing and not executing anything, for now our loop will look like this:

``````  def call
@lines.each do |line|
@tokens = tokenize(line)
next if @tokens.empty?
statement = parse_statement
raise "Extra tokens left over" unless @tokens.empty?
pp statement
end
end
``````

I'm saving `@tokens` to instance variable, so we don't need to pass it to every method.

Also to simplify the code, here's some helper methods. `next_token_is?` checks if next token is of any of the expected types and return `true` or `false`. `expect_token` shifts token if it's one of expected types, or raises exception if it's of a wrong type:

``````  def next_token_is?(*types)
@tokens[0] && types.include?(@tokens[0][:type])
end

def expect_token(*types)
raise "Parse error" unless next_token_is?(*types)
@tokens.shift
end
``````

And with that, here's our parser:

``````  def parse_factor
token = expect_token("number", "id", "(")
case token[:type]
when "number", "id"
token
when "("
result = parse_expression
expect_token(")")
result
end
end

def parse_product
result = parse_factor
while next_token_is?("*", "/")
op_token = @tokens.shift
result = {type: op_token[:type], left: result, right: parse_factor}
end
result
end

def parse_expression
result = parse_product
while next_token_is?("+", "-")
op_token = @tokens.shift
result = {type: op_token[:type], left: result, right: parse_product}
end
result
end

def parse_statement
case token[:type]
token = expect_token("id")
when "set"
var_token = expect_token("id")
expect_token("=")
expr = parse_expression
{type: "set", id: var_token[:value], expr: expr}
when "print"
token = expect_token("id")
{type: "print", id: token[:value]}
end
end
``````

I recommend taking a good look at this code. Recursive Descent Parsers are very readable like that, but since it's probably not the type of code you've seen before, it might be a bit confusing initially.

We can give it a try:

``````\$  ./math_parser.rb miles_to_km.math
:id=>"miles"}
{:type=>"set",
:id=>"kms",
:expr=>
{:type=>"*",
:left=>{:type=>"id", :value=>"miles"},
:right=>{:type=>"number", :value=>1.60934}}}
{:type=>"print",
:id=>"kms"}
``````

### Error handling

One huge advantage of Recursive Descent Parsers over parser generators is potential for great error reporting. Recursive Descent is much closer to the way humans think, so if there's an error, we can say something like `At line 20 char 1: expected one of "read", "set", or "print", but got token: "number"`.

For complicated technical reasons, parsers created by parser generators often have atrociously bad error reporting.

I'm mentioning this, as for sake of simplicity, for this episode we won't be doing anything beyond `raise "Parse error"`, but we could, and we will in the future.

### Run it!

To run the code we change the loop slightly:

``````  def call
@variables = {}
@lines.each do |line|
@tokens = tokenize(line)
next if @tokens.empty?
statement = parse_statement
raise "Extra tokens left over" unless @tokens.empty?
eval_statement(statement)
end
end
``````

And the code to evaluate parsed statements and expressions:

``````  def eval_expression(expr)
case expr[:type]
when "number"
expr[:value]
when "id"
@variables[expr[:value]]
when "+", "-", "*", "/"
left = eval_expression(expr[:left])
right = eval_expression(expr[:right])
left.send(expr[:type], right)
else
raise
end
end

def eval_statement(statement)
case statement[:type]
when "print"
puts @variables[statement[:id]]
when "set"
@variables[statement[:id]] = eval_expression(statement[:expr])
else
raise
end
end
``````

I left the `else raise` there to catch programming errors. If we coded things correctly, it should never be reached.

If you're not familiar with Ruby, `left.send(expr[:type], right)` is equivalent to `left + right`, `left - right`, `left * right`, or `left / right` depending on `expr[:type]`.

And that's all we needed to create our proper programming language:

``````\$ ./math.rb miles_to_km.math
500
804.67
``````

### Where do we go from here?

We didn't need any special tooling, just some regular expressions for tokenizer, and some plain recursive methods for parser and for execution.

This approach is perfectly fine for simple programming languages. You can just add more token types, more grammar rules, some error handling, and some interesting interpreter, and you can have a decent language for your needs. Nothing about it is specific to Ruby - you can use any language. Regular expression support is definitely a big plus, but people have been writing tokenizers without regular expressions too, looking one character at a time.

In the future we'll try to do something similar with tools like PLY (Python Lex-Yacc), or ANTLR.