Konstantin Grechishchev

Posted on Aug 6, 2022 • Edited on Aug 14, 2022

Writing a new programming language. Part I: a bit of boring theory

#rust #tutorial #grammar #programming

Introduction

Every serious software engineer should make an attempt to write their own programming language at least once. Apart from the set of knowledge obtained, it could be a fun process! I feel like it is now time for me to try this challenge.

I would like to challenge myself to write a series of tutorials to help the reader to learn about formal grammars, and LR parsers and get some practical skills to be applied in their project.

I don't have a goal to write a new programming language to be used anywhere for any specific purpose. In fact, I highly discourage anyone from using it :). I am not planning to make this language compile, instead, we would write the interpreter to run the programs written in it.

I am hoping to implement the support of expressions, if statements, basic loops, and functions, however exact feature set and the number of tutorials would depend on audience interest (number of reads, comments, and amount of reactions to the posts). I'll keep all my code in the kgrech/lr-lang with a branch pointing to the commit representing the code state by end of every tutorial.

I'll name the language I am writing as LR-language, simply because we are going to use LR parser to parse its grammar and because I was not able to come up with any better name.

Useful knowledge

I would be writing the language in Rust, so knowledge of the syntax and Rust skills would help. Rust book is a good resource to learn.

I'll be using the lalrpop parser generation library, so keep to documentation for it handy.

Some understanding of Chomsky grammar hierarchy, LR Parsers and Backus normal form (BNF) would be definitely useful, however, I intentionally do not plan to dive deep into the theory as I don't think I have enough theoretical background myself. There are good books to learn it. Instead, I'll focus on the practical aspects of it, while I have to spend some time explaining a theory in the first tutorial so that unprepared readers have an idea about what is going on. If you are not interested in this, please jump right away to Part II!

Formal grammars (crash course)

Terminal symbols and tokenization

The program written in our language consists of so-called terminal and non-terminal symbols. The terminal symbols are known and defined in advance and usually are just one character or a sequence of characters. For example, the terminal symbol is one of the following:

Letters, digits, underscores.
Expression operations (+, -, *, /, <, >, <=, >=).
Language keywords (for, if, else, let).

The first step of the program "compilation" is tokenization. The tokenization is usually performed by the component called lexer. Lexer's duty is to read the text of the program and convert it to the sequence of the terminal symbols with their values attached to them. The sequence of the terminal symbols are then processed by the component call parser to derive non-terminal symbols and produce the abstract syntax tree (AST).

You don't really need the formal grammar to implement the lexer. In fact, the example of the simplest lexer is just the split by whitespace function. In practice the lexers are more complex, but typically could be configured using a set of string literal and regulars expressions (this is how it is done in lalrpop library).

The examples above might look a bit contradictory as language keywords contain letters on their own. E.g. "for" consists of 3 letters. Are they 3 terminal symbols ('f', 'o', and 'r') or 1 terminal symbol (for)? This is called lexer ambiguity and there are different ways to resolve it. A typical strategy is to consider everything which matches language keywords as a single terminal symbol and split everything else. This is why you can not use keywords as the variable name in most of the programming languages. And yes, the variable name is an example of the non-terminal symbol.

Formal grammars

Formal grammar is the way to give meaning to the sequence of the terminal symbols. The grammar consists of the finite set of so-called "production rules". You can think about them as the set of rules defining how can we "replace" a sequence of the terminal and non-terminal symbols with a new non-terminal symbol (while technically you can replace it with a sequence of them, it makes everything a lot more complex).

The most common way to write such rules is called Backus–Naur form. The exact syntax of the rule is different based on the parser generation library used, but the idea could be illustrate by the following example:

<digit> = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 9
<number> = <digit> | <digit> <number>

The words in triangular brackets are called non-terminal symbols, "|" and "=" are the meta symbols and everything else is the terminal symbol.

So, non-terminal symbols are received by replacement of some terminals and non-terminal according to the production rule.

The hierarchy of grammars was proposed by Noam Chomsky in 1950-th. Chomsky divided all grammars into four types (from simplest to more complex):

Regular.
Context-free.
Context-sensitive.
Recursively enumerable.

The difference between them is what kind of rules are allowed in each. We are mainly interested in the first two as there are well-know ways to implement parser for them. We will skip the last two as they are the way too complex to parse and hence they are not so useful.

Regular grammars

A well-known example of regular grammar is regular expressions (yes, that is why they are called this way) available in the standard library of most languages. The key property of the regular grammar described language is that it can be decided (parsed) by a finite state automaton. That means that no matter how long the program is, it would take a constant amount of memory to parse it.

In case of regular expressions:

Each character is the terminal symbol.
The regular expression is the grammar (written not in Backus–Naur form by the way).
The terminal symbols are captured groups.
The "program" is the string.

Unfortunately, regular grammar rules are too simple to express most of the programming languages including our baby LR-Language. So we will have to use context-free grammar!

Context-free grammars

Formally speaking, context-free grammar is the one allowing the single non-terminal symbol to appear in the left part of the rule and the sequence of one or more terminals and/or non-terminals in the right part of it in any order. In other words, the right side of the rule is not restricted. This is not true for the regular grammar, which has restrictions on the order of the terminals/non-terminals on the right side. Attentive readers could spot that every regular grammar is context-free, but not vice-versa.

Context-free grammars could be parsed by the deterministic pushdown automaton.

Abstract syntax tree

The main reason to define and apply the production rules is to build the so-called abstract syntax tree (AST). The AST is a set of data structures that represent the parsed language. It is a tree because one data structure contains (points) to another. The AST structures are written in one of the existing programming languages (we will call it target language), i.e. in the language the compiler and parser are written. In our case target language is Rust, because the parser of our LR-language is written in rust.

When the parser parses the program it applies production rules one by one. The first rule would only take a set of terminal symbols as there were non-terminals produced yet.

Most of the parser generation libraries allow you to write a piece code in the target language next to the rule. This could converts the rule input into the data structure representing a non-terminal symbol. For the first rule, the input would be just a set of terminal symbol strings. If there is some rule containing the non-terminal symbol produced above on the right side, the associated target language code for this rule must accept the above data structure as an input.

This way, when we apply a rule by rule, we nest one non-terminal data structure into other forming the abstract syntax tree.

The interpreter can then traverse this tree and execute the program.

Next steps

You've reached here? Good job on reading all of the above!

We will apply the above theory in the next part of the tutorial and write the first part of our language. We are going to define the first version of the formal grammar using the lalrpop parser generation library, generate deterministic pushdown automaton in rust and write the interpreter to run our first program written in LR-language.

Stay in touch!

Bonus: Russian-speaking readers are welcome to watch my youtube video about formal grammars.

DEV Community

Writing a new programming language. Part I: a bit of boring theory

Introduction

Useful knowledge

Formal grammars (crash course)

Terminal symbols and tokenization

Formal grammars

Regular grammars

Context-free grammars

Abstract syntax tree

Next steps

Top comments (0)

Read next

2097. Valid Arrangement of Pairs

7 practical ways to build Backends much faster as a developer

My 2024 Journey: Learning from My Mistakes as a Junior Dev

Is WebForms Core the Most Important Web Technology in 2024?