One fine evening in the first lockdown of 2020, there I was, sitting idly, not knowing what to do.
Calmly freaking out.
You see, I really needed something to do. I had been doing a few web related projects on the side and that was something I didn't want to do any more, at least for a while. So I looked into doing something "closer to the metal", something much lower level than sending requests back and forth to a web server. So I quickly fired up Learn X by doing Y and searched for something interesting, eventually ending up on Building your own Lisp (We all have a Lisp phase, it was just my turn).
For the uninitiated, Lisp is (simply put) a family of programming languages whose syntax looks more or less like this (You can quickly see where the parentheses jokes stem from).
(defun area-circle(rad)
"Calculates area of a circle with given radius"
(terpri)
(format t "Radius: ~5f" rad)
(format t "~%Area: ~10f" (* 3.141592 rad rad)))
(area-circle 10)
But I didn't just want to create any Lisp, I wanted to make some modifications as per my preference. So there was only 1 way to go about doing this.
Coming up with a syntax
What got me into lisps and writing Lisp was (suprisingly) the Emacs text editor (or Emacs OS for the hardcore users). What was I doing going anywhere even remotely near Emacs? I don't know, I'll tell you some other day. You see, writing Emacs Lisp is the recommended way of configuring Emacs. Even though I didn't write a lot of Emacs Lisp, a few of the keywords I saw there didn't feel very natural to me (especially coming from using Ruby). So one of the major goals would be to have more "natural" sounding keywords.
You can't talk about Lisp without talking about the parentheses. A big advantage in readability for other languages is that everything isn't inside a parenthesis. Square brackets usually denote an array, braces usually denote code blocks, while parenthesis are used for function calls. It leads to easier visual grepping of code, not so much in Lisp, especially as a beginner. So I need to reduce the usage of parentheses, but I cant just change all of its syntax because then it might not stay a Lisp anymore (Ship of Theseus, anyone?). I think Clojure strikes a good balance between staying a Lisp and improving visual grepping. So another goal would be to reduce the reliance on parentheses so that the code can be more readable.
'(1 2 3) ; list
[1 2 3] ; vector
#{1 2 3} ; set
{:a 1, :b 2} ; map
(Example from Clojure syntax guide)
So it began
After setting all these requirements for the language, I set out to finally start building it. I chose Rust to build it because it seemed like the right tool to do so because of it's speed and memory safety. I was building my own programming language, I felt like a "real programmer" (no such thing btw) so naturally I started tweeting about it.
Starting simple
I wanted to start off with something as simple as possible, like adding a few numbers. It felt like the perfect starting point. I'll be evaluating a simple arithmetic expression written in Lisp, for example, 2 * 3 + 1 - 4, which would be written as (- (+ 1 (* 2 3)) 4) in Lisp. To the unaware, this is in the form of Polish notation, just think of it as a simple function with multiple arguments. (+ 1 2 3)
is simply the sum of 1, 2 & 3. Its result can then be used as an argument to a parent function, and then that gets evaluated. Similarly the whole expression can be solved using the method below.
(Source: Crafting Interpreters)
After building the initial prototype I tested it with (+ 2 (- 4 6) 8). Doing it manually, you can tell it's 4 - 6 + 2 + 8 = 8. And if you see the result in the bottom red rectangle, the result is indeed 8.
Laying the building blocks
Now that we've had something to play around with, it's time to grow up and lay the foundations for a proper compiler. The first step for any compiler is to read the source file and turn it into a stream machine readable objects called tokens. This step is called lexical analysis, or lexing. It simply maps all possible characters in the programming language to a Token object. Like so:
Character | Token |
---|---|
( | Token { kind: OpenParen, value: None } |
let | Token { kind: Identifier, value: Some("let") } |
-?([0-9])+ | Token { kind: Number, value: Some(getVal()) } |
In this way you can build a mapping for token and their various types, so that its easier to build your syntax tree in the next step.
Cool, now that I can generate tokens, it was time to build on top of this abstraction and give this stream of tokens a proper structure so that it's easier to work with. We're going to go examine every token and make sense of it all by generating a syntax tree. This step is called parsing and at its end, we going to have something like this:
(Source: Abstract Syntax tree:Wikipedia)
So I go about trying to build a syntax tree, starting with defining a basic grammar in the form of an enum which looked something like this
pub enum Expr {
Constant(Value),
Builtin(Ident),
Call(Box<Expr>, Vec<Expr>),
While {
condition: Box<Expr>,
body: Vec<Expr>,
},
}
Reading the source code of the Rust programming language was very helpful and I might have stolen some concepts from it. Let me just quickly define what these types are:
- Constant: Any literal of type
Value
(number, string, boolean) - Builtin: Any builtin token like an operator (+,-,*,/) or a keyword (let, print, if)
- Call: this is simple, any function call (truth be told, everything should just be a function call since Lisp is a functional language)
- While: A while loop (duh)
(If you're wondering, Box is Rust's built-in type for pointers, broadly speaking.)
This is the gist of it, rest of the process was just finding ways to build this syntax tree using the token stream I had. Slowly but surely, I was making progress.
At this point, the syntax tree felt robust enough for what I wanted the language to do for now. It was time to move on to bigger and better things.
LLVM
Low..Level..Virtual Machine? good guess, but not anymore. What is LLVM? let's first take a step back and see why it's needed in the first place. You see, a compiler has 2 sides - a frontend and a backend - the former containing lexing and parsing, while the latter has intermediate code generation, optimisation and target code generation. What I'd built until now was just the frontend, and I had the internal representation that can be used by the backend.
To build all the backend from scratch, here's what I would have to do. Firstly, limit myself to an architecture since supporting multiple processor architectures was not an option, simply because of the intricacies of the way processors work. After selecting the most popular processor architecture I'd need to get a good grasp of its assembly language. You see, because of the different ways microprocessors are designed and made, they all have different instruction sets. To support an instruction set, you need to take the code the user has written in your language and translate it in the form of instructions of that specific set. Processors can work in weird ways, sometimes for compatibility reasons and sometimes just because they can. So when you optimise your generated code, you have to take all that into account. Let's say you do manage to support one architecture successfully but now there's (somehow) popular demand for your language on some other architecture, you will have to repeat this process all over again, and then again for another architecture, and so on. Don't even get me started on how you would debug errors that are specific to only some architectures. It would be a huge mess.
This is where LLVM comes in, it provides a convenient abstraction layer over all the instruction sets that you can target. You only need to worry about translating to its own internal representation (LLVM IR). LLVM will then handle all the code optimisation to its IR, and then for the instruction set you're targetting. But it's still a good idea to first optimise your code and then give it to LLVM, instead of just expecting LLVM to do it all.
So to use LLVM, all I needed to do was to take my AST and convert it into LLVM IR. I did it using Inkwell the LLVM API in Rust. It is a wrapper around the C API that LLVM has built, and on top of that provides the memory safety guaranteed by Rust. Oh, and this is what LLVM IR looks like in text format (Don't try to understand it, feel it):
; ModuleID = 'example'
source_filename = "/home/faraaz/hello.tp"
@format_string = private unnamed_addr constant [4 x i8] c"%f \00", align 1
define i32 @main(i32 %0) {
entry:
%num = alloca double
store double 4.200000e+02, double* %num
%num1 = load double, double* %num
%printf = call double (i8*, ...) @printf(i8* getelementptr inbounds ([4 x i8], [4 x i8]* @format_string, i32 0, i32 0), double %num1)
ret i32 0
}
declare double @printf(i8*, ...)
Code in LLVM IR that sets a variable to 420 and prints it
Making it all legit
Until now, executing the compiler's binary did only one thing and it wasn't very configurable or helpful for someone using it for the first time. I wanted it to have a proper Command Line Interface (CLI), something like this:
So I used the Clap crate (aka package) to build the CLI and I think it turned out all right.
Bringing it all together
Now that I have the compiler set up, and the language has features like storing variables, performing arithmetic, looping (nested loops don't work yet) and printing to screen, it was time to make a small program with it. Here's me writing a program to compute the first 5 Fibonacci numbers in Tisp and then compile and run it:
Here's the project if you want to check it out yourself.
faraazahmad / tisp
A toy programming language based on Lisp and built in Rust & LLVM
Tisp (Toy Lisp)
A Lisp-like programming language that is typed and compiled. It aims to support multiple processor architectures by being built upon LLVM. It takes inspiration from programming languages like Rust, Lisp and Elixir.
Current working example
A program to compute first 5 fibonacci numbers:
(let first 0)
(let second 1)
(let fib)
(let n 0)
(while (< n 5)
(let fib (+ first second))
(let second first)
(let first fib)
(let n (+ n 1))
(print fib)
)
Features to build
- Convert raw code into token stream
- Convert token stream into Expression tree
- Handle multiple levels of nested expressions
- Have multiple (independent) expressions per file
- Generate LLVM IR for the currently supported features
- add CLI flag to emit llvm
- add while loop
- Declare variables
- add…
Here's what I learned
Programming languages are hard. Its' hard to read code in the first place. To understand and learn one is way harder. One can only imagine how hard it is to build one, let alone build one that is widely used. This project was mostly me trying to answer the question "How hard can it really be?", I guess I have my answer now. But it's also kinda easy low-key. Once you dip your toes in the hard parts, the next time you do it, it feels a little easier. Every day you do something hard, it gets a little easier, but you gotta do it everyday, that's the hard part.
Top comments (22)
I remember learning compiler and language theory back in the 80's using the Aho and Ullman "Dragon Book". Languages were still very much a "black art" back then with institutional memory on the development of FORTRAN still around (which was an amazing feat in and of itself as there was no formal understanding of the design of computer languages when FORTRAN was created).
I could have written that. :) The Dragon book and Fortran - yes.
And of course I had to write a LISP interpreter.
Yeah that is still the gold standard imo. I've Used that in college as well. I think LLVM has been a massive game changer in enabling normies like me to create small toy language for themselves.
Kudos 👏! I'm working on something similar myself and have seen and overcome many of the same challenges. Making a language is very rewarding IMO.
All this said, I have to give you extra 👍 for the movie quote (feeling, not understanding); it somehow fits the narrative perfectly 😉
The final BoJack scene is just perfect 💪
I'm glad you like it! I let this post marinate in the drafts for a while making small changes until it felt cohesive and entertaining enough to post. Looking forward to your post!
I'm not looking to write a post, am far more ambitious for "just" a post 😉
The plan is to create several video courses on how to build a language... we'll see how that goes 😁
Awesome! If you need any ideas, @bellmar has an ongoing podcast series on making a programming language of her own at dev.to/mwapl
I may not be looking for ideas, but a fellow language engineer’s work is practically always interesting. Thanks for the link!
My background includes code conversion utilities. Thus you see source code strings and literals embedded in source code. The parser in particular.
This can be both hard to read and potentially cause bugs were it examining itself in a self-referential manner.
I would use llvm now, this was done in the mid-80's.
Re that whole I should have mentioned how glad I am you are in software and not nuclear physics. ;-) Boredom often drives creativity.
Would be interesting to read source code of parsers from back then
I'm back. lol. What was really interested was the release distribution software. It worked with source code converted for various platform variants. PICK Databasic.
Early use of modem required that remote updates were tiny in size. Like GitHub only the changes were distributed, applied and compiled locally. 1980's was when it started.
I created early (line) fault tolerant communications software to transmit the updates.
Man, how bored were you? :D JK, great article 👏
Really really bored! :D
thanks though!
Nice writeup !
Very cool seeing the whole process laid out like this!
Thanks! I was trying to make a series of posts on it but i realised that would feel more like a tutorial and there's already a lot of them around. So I put it all in one post and tried to make it interesting, hopefully it's not boring
Wonderful! I liked how you came up with an idea and executed it.
I will be featuring your article in my newsletter. :')
Awesome! I'm glad I can provide good content
Thank you!
Nice job!