Brett Martin

Posted on Jan 4, 2020

The First Two Weeks: A Compiler Writing Journey

#compilers #interpreters #standardml #devjournal

Photo by Nick Fewings on Unsplash

Welcome back to the first update in my compiler & interpreter journey! I want to spend some time this post to share some things I have learned about so far, as well as what I plan to tackle in the next two weeks! The first few sections will focus on Compiler / Interpreter information that I have been gathering and starting to digest. The later section will focus on my experience so far with Standard ML, the language that I will be utilizing with the book "Modern Compiler Implementation in ML"

Here is a look at the sections that I want to cover:

What Are Compilers And Interpreters?
What Are The Stages of Compilation?
Lets Check Out The "Super Tiny Compiler"!
What I Have Learned About Standard ML
Useful References And Links

What Are Compilers And Interpreters?

A compiler is a program that translates code written in one language (the source language) into another language (the target language). The target language is usually something like assembly or machine code when the compiler is creating an executable, but it can also be another high level programming language like JavaScript. The Rust compiler is an example of the former, while Babel is an example of the latter.

A key characteristic is that a compiler does its work ahead of time. You can use the compiler to generate its output and then wait to execute it at a later time. A side effect of this, is that an executable generated by the compiler can exist even if the source code is lost.

An interpreter, on the other hand, directly executes the instructions written in the source language without converting it to some target language before. It requires the source code every time it is executed!

Some examples of compiled & interpreted languages:

Compiled	Interpreted
C, C++, Rust, Standard ML, and Java	Lua, Python, and JavaScript

Now, an interesting thing is that any language can be compiled or interpreted. In fact, a lot of programming languages provide both capabilities to developers to improve the development experience. For example, if I want to write some Elixir code it will be compiled to bytecode to be run on the Erlang VM, but if I write an Elixir script or use the interactive mode REPL (Read-Evaluate-Print-Loop) it will behave more like an interpreted language.

As I will show in the next section, compilation (and interpretation) have multiple stages to it. The "front end" stages of parsing and lexing, and creating an abstract representation of the source will be very similar for compilers and interpreters. However, the "back end" stages will start to diverge as an interpreter is more concerned with immediate execution while a compiler is concerned with creating an executable.

What Are The Stages of Compilation?

Compilation is broken down into two major stages, and in those stages are smaller stages that will be of focus as I work through implementing a compiler. I hope to speak to each part in my own words as I go through them!

The two major stages are the front end, and the back end. I know, super descriptive. Similar to any other piece of software, a compiler should be designed with modularity in mind. The front end is the stage of the compiler that takes the source code written in the programming language and turns it into an intermediate representation (IR). This IR is some data representation of the program, that is independent of the source programming language itself. The back end takes the IR and does optimizations and also generates the output of the compiler into some other target language.

This is a very broad description, but it is the first division that can be seen in compiler architecture. A slightly more in depth look can be seen in a lovely picture of a mountain in Crafting Interpreters, in the "A Map of the Territory" chapter. This shows some sub-steps of the front end and back end, like scanning, parsing, analysis, and code generation. The book that I am following has twelve stages listed out! Those stages are:

Lex
Parse
Semantic Actions
Semantic Analysis
Frame Layout
Translate
Canonicalize
Instruction Selection
Control Flow Analysis
Dataflow Analysis
Register Allocation
Code Emission

If I am being honest I know what maybe two of those stages entail (lex and parse). The rest I feel like the profit gnomes in south park.

I am excited to be able to someday intelligently speak to all of these stages in the future, and I think that will be the focus of most future posts. This post is a little all over the place as I am getting my bearings and looking at various resources as I learn Standard ML. Speaking of all over the place...

Lets Check Out The "Super Tiny Compiler"!

Now that we have a bit of understanding of what a compiler is, I think it would be fun to look at a very simple compiler, The Super Tiny Compiler! This compiler is written in JavaScript and very well annotated, so I won't dive too deep into it because I encourage you to take a look at the authors great work!

Essentially this compiler takes some Lisp like function calls, and compiles them into C like function calls! Essentially taking something like (add 2 2) and converting it to add(2, 2).

What I want to take a look at is the entry point for this compiler, this function that loosely follows the lex, parse, translate, and code emission steps above:

function compiler(input) {
  let tokens = tokenizer(input);
  let ast    = parser(tokens);
  let newAst = transformer(ast);
  let output = codeGenerator(newAst);

  // and simply return the output!
  return output;
}

This takes us through the major steps: tokenization, parsing, transforming, and code generation. The first step is what finds the individual parts of the input, taking something like (add 2 2) and building a list like

const tokens = [{type: 'paren', value: '('}, {type: 'string', value: 'add'},
                {type: 'number', value: '2'}, {type: 'number', value: '2'},
                {type: 'paren', value: ')'}];

to represent the input program, with added metadata about what each token is. The next step takes care of making sure that those tokens make sense with the semantics of the language. Luckily in the case of this compiler, there are no keywords, but there are expectations of matching parens!

After creating a correct program as defined by its semantics, the transformation of the abstract syntax can take place to change its representation to a C like function call.

Finally, after the transformation is applied, the code is generated in the C like manner! Again, this is a highly simplified overview of what is in the compiler, but I encourage you to take a look at the source - pull it down and tinker with it. See if you can add something new to it. What about transforming from C -> Lisp instead?

As an aside, I think that is an interesting property of compilers and languages - they can be as simple or complicated as needed. That means that I might detour here and there as I am learning to write small, simple languages that focus on implementing the new techniques! 🤓

What I Have Learned About Standard ML

This is where I spent the majority of my last two week. Standard ML (Standard Meta Language aka SML) is a modular functional programming language that is supposedly very well suited to compiler implementation. I have been posting to the GitHub repository standard-ml-learning all of the code that I have been writing as a result of learning - feel free to check it out!

I have a couple of directories in this project, one for following the text "Programming in Standard ML", and the other is a workspace for me to do practice problems from an online course I found CS 312.

One thing I have noticed is that while the text has been useful, its easy to think that I am learning while following along with examples. Its another thing to know that I am learning by tackling small coding challenges and other more free form problems and getting the correct solution!

I am lucky to have some experience with functional programming - I try and use a functional style when it makes sense at work with JavaScript and I have dabbled in Haskell and Elixir as well. The idea of coding in a declarative is a little less harsh for me coming in with that background. Even though that is the case, I still love tripping up over little things as captured in my notes:

Ok, I have typed var and let way too many times when trying to write sml - I have to remember its val!

And even that note to myself is misleading, because there IS a keyword let as well!

One thing that I liked was the concept of sharp notation, which is the following:

   val person = ("Jim", "Bob", "Software Developer", #"A", 45)
   val fullName = (#1 person) ^ " " ^ (#2 person)
   val jobClass = #4 person

The sharp notation is a way of accessing values in an n-tuple. As you can see, it isn't the most clear what it is trying to do, but I can imagine that it could be useful in an anonymous function for mapping or something like that.

Another thing that I always love with languages like this is the ease of pattern matching.

   val person = ("Jim", "Bob", "Software Developer", #"A", 45)
   val (firstName, lastName, _, jobClass, age) = person

This will bind the individual values of the tuple to the variables firstName, lastName, jobClass, and age. Note the use of _ in the pattern matching, this ignores the field "Software Developer" and doesn't bind it to anything!

Another powerful concept is the cons operator which is written as :: and is used during list processing. Can you guess what this function does?

fun mystery [] = 1
  | mystery (x::xs) = x * mystery xs

Yeah, I cheated a bit by throwing a bunch of new syntax at you, but it takes a list of integers, like mystery [1,2,3,4,5] and returns 120 - thats all of the elements multiplied together! So, the new keyword introduced is fun, which is a function declaration. Then pattern matching is employed in a new way! The first pattern checks for the value [] which is an empty list. If there is an empty list we return 1. So calling mystery [] would result in 1. The next part of the pattern matching uses the cons operator to destructure the list provided. By doing this, we are grabbing the head and tail of the list - x is the head, xs is the tail. If the list is [1,2,3,4,5], then x = 1 and xs = [2,3,4,5]! Now, given that information hopefully the function body of x * mystery xs makes sense. We are recursively calling the mystery function with a reduced version of the initial input: 1 * mystery [2,3,4,5]. This will continue to happen until we reach the base case of [] and return 1, then the full evaluation will occur of 1 * 2 * 3 * 4 * 5 * 1, which returns 120. Awesome!

There have been a lot more topics that I have learned as well from recursion, to higher order functions, to exception handling. I don't think I am the right person to teach these things, as I have linked to the primary source of my learning, so instead I will conclude by sharing some of the code that I wrote that wasn't guided by the book - some challenges that I tackled on this site as well as some problem set code I wrote for the CS 312 course:

Daily Challenge #148 - Disemvowel Trolls

val vowels = [#"a", #"e", #"i", #"o", #"u"]
fun member_of (item, list) = List.exists (fn x => x = item) list
fun disemvowel s = implode (List.filter (fn x => not(member_of(Char.toLower x, vowels))) (explode s))

Daily Challenge #149 - Fun with Lamps

fun gen_alt (starting, next, len) = List.tabulate(len, fn x => if x mod 2 = 0 then starting else next)

fun diff ([], []) = 0
  | diff (x::xs, y::ys) = (if x = y then 0 else 1) + diff(xs, ys)
  | diff (_, _) = ~1 (* List lengths don't match for some reason *)

fun lamps [] = 0
  | lamps i = Int.min(diff(i, gen_alt(0, 1, length i)), diff(i, gen_alt(1, 0, length i)))

Answers to some parts of Problem Set 1

exception NumberFormatException

fun parseInt (s: string) : int = 
    let
        val SOME x = Int.fromString s
    in
        x
    end
    handle Bind => raise NumberFormatException


datatype tree = Node of tree list

val tt = Node([Node([Node([Node([])])]), Node([Node([]), Node([Node([])])]), Node([Node([])])])

fun treeSize (Node([])) = 1
  | treeSize (Node(x::[])) = 1 + treeSize x
  | treeSize (Node(x::xs)) = (treeSize x) + (treeSize (Node(xs)))

val correctSize = treeSize tt = 10

fun rev [] = []
  | rev (hd::tl) = rev tl @ [hd]

fun isWhitespace c = c = #" "
fun reverseWords words = 
    String.concatWith " " (rev (String.tokens isWhitespace words))

val reversed = (reverseWords "A MAN A PLAN A CANAL PANAMA") = "PANAMA CANAL A PLAN A MAN A"

Some answers to Problem Set 2 - Part 2

(* Part 2 *)
(* a *)
val product = List.foldl Int.* 1

(* b *)
fun even_odd_idx (a: 'a, (b1: 'a list, b2: 'a list)) : ('a list * 'a list) =
    if (length b1 = length b2) then
        (b1 @ [a], b2)
    else
        (b1, b2 @ [a])

(* Any way I can make this point free? *)
fun partition (l: 'a list) :  ('a list * 'a list) = List.foldl even_odd_idx ([], []) l

(* c *)
fun apply_twice_positive (i: int) = fn (f: int -> int, count: int) => if f(f i) > 0 then count + 1 else count
val count_positive_funcs = foldl (apply_twice_positive(~1)) 0

(* This returns 2! *)
val positive_count = count_positive_funcs [fn x => x + 1, fn x => x - 1, fn x => x * ~1, fn x => x*x]

One final thought on Standard ML learning - there are a lot less resources than a popular modern language! I know this is to be expected, but I was very surprised to see that only around 1800 questions had been asked on stack overflow, as opposed to the 1.9 million you see for JavaScript. I am very happy with my progress given that fact, and I am getting close to the goal of being able to fully understand a large "real world" program implemented in SML.

Useful References And Links

One resource I really want to give a shout out to is Crafting Interpreters - this is a really well written and free resource on writing an interpreter for a programming language called Lox. The first part is implemented in Java, the second part is implemented in C.

The only reason I am not using this as my primary learning text is that I wanted a more rigorous text to get started - something that dives a little more into the theory. I wouldn't be surprised to see myself reference this resource throughout my journey though!

Another place that I have been looking at is the programming languages subreddit, /r/programminglanguages I am using it as a gauge to see how much I am learning - at the moment a lot of the topics being discussed are way over my head, but I hope to start understanding the common problems discusses in programming language creation as I learn.

The subreddit also has an associated discord server for a little bit more live discussions.

Until Next Time

If you made it this far, thanks for reading! I think that moving forward I am going to move my "publish" date to a Monday so that way I can spend a bit more time organizing my thoughts. This post felt like it was a bit more of a brain dump than I wanted it to be this time around.

By next time I hope to have finished my initial learning of Standard ML so that way I can talk a bit more about implementing a specific stage of a compiler next time. I am super excited to start!

I will post the next update on January 20th, and all future updates will be on the Monday 2 week after.

If you have any corrections or clarifications to statements I have made, please drop a comment. The last thing I want is to be misleading anyone, even though this is about my journey to learn and not a tutorial.