Jason Barr

Posted on May 23, 2023 • Edited on Jul 2, 2023

How To Create Your Own Programming Language

#programminglanguages #compilers #interpreters #javascript

Programming languages fascinate me.

They are the means by which we communicate human desires into machine-usable instructions that can create amazing art, produce scientific breakthroughs, and carry out just about every type of business on the planet.

We break down these human-readable words and phrases into electrical impulses that turn tiny switches on and off, and the entire modern world is based on the movement of those little switches.

To program is an act of creation, and programming languages are the means by which we programmers create our universes.

In this series of posts, we will build our own programming language together.

Why Create Your Own Language

There are hundreds of programming languages already. Why create your own?

Well, maybe there's something you want to do that existing languages just don't do well enough for your liking.

Maybe your passion is data visualization, and you want a language geared specifically for that task.

Languages are also fantastic for experiments. Some of the most beloved languages today have their roots in language experiments. For example, the ML family of languages (which includes OCaml and F#) started life as a theoretical type system. Lisp originally came from a paper on theories related to recursive functions. Ideas from these 2 language families have made their way into several modern languages that are in widespread use, including Rust and JavaScript.

Maybe you just enjoy a good programming challenge. Some of the hardest programming problems I've ever solved have come about while building interpreters and compilers. Creating a programming language brings together skills from a wide range of programming and computer science topics, including data structures and algorithms, computability theory, and more.

As for me? I just think creating languages is fun. I enjoy it, and I hope that maybe some of the people who read this series will come to enjoy it as well. I'm a better programmer because of the projects I've built while learning and creating new languages, and I'm sure you'll learn a lot from it too.

I'm far from an expert on language design and implementation. I just have a lot of fun doing it, and I'm here to document what I'm learning in case it's helpful for anyone else.

What Is A Programming Language?

Is a language the symbols (syntax) you use to tell the computer what to do? Is it the semantics (the meaning of those symbols)? Is it something else? A compiler or interpreter? All of the above?

I'm going to say it's a little bit of all of them. Obviously the semantics are the most important part. You can use the same syntax to implement vastly different semantic models. See the "Metacircular Evaluator" in chapter 4 of Structure and Interpretation of Computer Programs for an example of this. Using the same s-expression based syntax, the authors implement 4 radically different evaluator semantics.

But syntax matters as well. A good language probably feels good to write code in, and syntax obviously plays a part in that. Of course, what "feels good" might be very different for different programmers. Some people absolutely despise the minimal syntax of Lisp family languages with all its parentheses, but others find the symmetry between code and data structures that syntax allows to be the most beautiful thing in the world.

In the absence of a formal specification, the compiler and/or interpreter may serve as the source of truth for a language's behavior and semantics. Such will be the case for the language we build in this series. We're going to build a compiler that translates our source language into JavaScript. You might prefer to call that a transpiler instead of a compiler, since it doesn't emit assembly code or binary files. That's ok, you can call it whatever you want – it doesn't change the functionality.

How A Compiler Works

A compiler takes a string of code in a source language and converts it into another language that can be executed by a machine. Some compilers output binaries directly that are interpreted directly by the computer's hardware. We're going to compile our language to JavaScript, which can be executed either in a browser or by a server-side runtime like Node.js.

A compiler has several stages through which it transforms source code into its output, which can broadly be thought of as its front end and back end.

In the front end of the compiler, code gets read and transformed into a data structure used internally by the compiler. Usually a tree of some kind. This is usually done by lexical and syntactic analysis of the source code, or lexing and parsing.

During lexing, the compiler transforms the input text into a series of tokens that can be further analyzed by the parser.

The parser then takes the stream of tokens and converts them into a tree. This may be a parse tree, sometimes also known as a Concrete Syntax Tree, which contains all the data from the parsing stage, or it may produce an Abstract Syntax Tree (AST) which contains only the information necessary for further processing.

The compiler front end then moves into the semantic analysis phase which performs various checks on the AST to make sure the input program is well-formed.

Semantic analysis includes things like variable resolution, to make sure no uninitialized variables are used in the program, as well as type checking and inference (if it's a statically-typed language). Some languages perform even more semantic checks.

During the front end phases of the compiler, linguistic complexity tends to increase. We go from flat text to a data structure, and add annotations to the data structure during the process of semantic checking.

Then during the back end phases of the compiler the data structure created by the front end tends to be lowered and linguistic complexity decreases.

Higher level forms that are syntactic sugar get desugared into their core language forms.

Trees get converted into intermediate representations that allow for optimizations.

Finally, the compiler produces the end product and either outputs it into a file or feeds it into an evaluator for execution.

Our New Programming Language: Wanda

The language we're going to build throughout this series is called Wanda. Wanda is a fairly simple expression-based language with a few core forms, but it should be fully-featured enough to write interesting programs with.

Wanda's syntax is s-expression based, so it looks like a Lisp. Wanda code compiles to JavaScript values, and so like JavaScript it has first-class functions (i.e. functions that can be used like any other value, including as arguments to and return values from other functions).

Unlike JavaScript, though, we're going to restrict object creation to a class-based system. Under the hood the classes are built on top of JavaScript's prototypal delegation, but in Wanda itself you'll need to define a class to construct objects.

Wanda has a few primitive types:

; numbers are simply JavaScript numbers, i.e. 64 bit floats
15
; strings are JavaScript strings
"hello"
; booleans are JavaScript booleans
true
; nil combines JavaScript null and undefined
nil
; keywords are symbols with a colon at the beginning followed by a valid Wanda identifier
:keyword
; identifiers can have several characters not allowed in JavaScript
this-is+a*valid&identifier

Use the def keyword to define a constant or function:

(def name "Jason")
(def inc (x)
  (+ x 1))

Or var to define a variable and set! to change its value:

(var x 5)
(set! x 10)

You can define vector and map literals:

[1, 2, 3] ; note that commas are treated as whitespace
{:a "hi" :b 14}

Wanda has a for form that is an expression:

; for expression through the loop
; produces a list [0, 1, 2, 3, 4]
(for map ((i (range 5)))
  (println i)
  (inc i))

; produces the number 55
(for fold ((sum 0) (n (range 1 11)))
  (+ sum n))

Conditionals are expressions, so you can use them in variable assignment:

(def status (if (equal? state "Loading")
                "Loading"
                "Finished"))

Unlike with other Lisp-style languages, in Wanda you can access object properties with dot notation:

(def me (new Person "Jason" 42))
(println me.name) ; prints "Jason"

Wanda is gradually typed, including generics, so if you add type annotations to variables or functions it turns the type checker on:

(type Person { name: string, age: number })
(var (me: Person) { name: "Jason", age: 42 })
(def id (x: @a) -> @a
  x)

We'll cover additional language features, including classes and objects, pattern matching, and macros as we implement them. I'm especially excited to get to macros with you – implementing them is a real brain bender but they're pretty amazing once you get them working.

Like other Lisp-style languages, the s-expression syntax maps directly onto an in-language data structure in the reader. Wanda will then have an additional parsing step where the fully expanded tree produced by the reader (including expansion of any macros) is transformed into an object-based AST we can use for semantic analysis and additional compiler passes in the back end. We could just as easily output code based on the reader tree, but I'm using a full AST for learning purposes and because it's easier to do certain operations on an AST.

The Implementation Language

JavaScript is the implementation language for Wanda. I've chosen to write the compiler in JavaScript largely because it's an easy language to learn and most people who come across this series probably already know at least a little JavaScript. It's fairly easy to hack out simple projects in JavaScript, and I wanted to make this series as accessible as possible to as many people as I could.

An Incremental Approach

I've loosely based this series on Abdulaziz Ghuloum's seminal paper "An Incremental Approach to Compiler Construction" in that each article will end with a fully-functional compiler. We'll start out by compiling and evaluating numbers, and end up with a fully functional language that should hopefully be capable of supporting a wide range of interesting programs. It won't exactly be production grade in all aspects, but my goal is to make it interesting and useful for writing real programs and applications so you can see what it's like to implement a complete programming language that is more than just a toy.

I'll also cover some interesting and useful algorithms and data structures throughout the series that you may find helpful even if you never write another compiler.

Upcoming Articles

Right now the plan is to write articles implementing several features into the language, including:

Numbers
Additional primitive types (e.g. strings, booleans)
Cons pairs and lists
Variables
Vectors and Maps
Conditional expressions
Functions
Iteration
Objects and classes
Async
Exceptions and exception handling
Variants and pattern matching
Macros
Modules
Data flow analysis
AST transformations and optimization
A fully-featured CLI
Gradual typing (optional static typing)

What You Won't Find in this Series

You won't find a lot of programming language theory in this series, so if you're looking to learn about things like the role of deterministic finite automata in regular languages, context-free grammars, parsing algorithms, formal semantics, and so on, you'll need to consult a good book to learn about those.

There's nothing wrong with learning about that stuff.

Indeed, I'd say it's important to do so if you have an abiding interest in programming languages, but it's not the focus of this particular series.

I'm going to focus on implementation of a compiler with occasional notes about semantics and language design.

The goal is to produce something more in the spirit of Jack Crenshaw's classic Let's Build A Compiler series, though I wouldn't dream of putting my work on the same level as his.

I'll also cover some computer science topics along the way as they're relevant to what we're building, but as a means to help you better understand what we're building.

Inspiration

I've been blessed with the opportunity to use a number of excellent resources along my programming language journey, including a number of excellent books and courses, but Dmitry Soshnikov's series of programming language courses deserves special mention. He has courses on both theory and implementation, including building a parser, building interpreters and virtual machines, and even compiling a language to LLVM intermediate code. I wouldn't be where I am today without his materials and I highly recommend you check them out if you have further interest in the topic.

Conclusion

I hope you're as excited about this foray into language implementation as I am! The next entry in the series will present our first compiler, supporting numbers. It will also build up a lot of the boilerplate we'll need to compile more interesting expressions. Stay tuned!

DEV Community