Jason Barr

Posted on May 24, 2023 • Updated on Jun 11, 2023

Create Your Own Programming Language 1: Numbers

#programminglanguages #compilers #interpreters #javascript

In today's post we're going to scaffold out the compiler for Wanda and add numbers to the language.

Numbers will be represented by the JavaScript number type, which is a 64-bit floating point number.

Floating Point Numbers vs. Decimals

If you're not sure what that means, here's the tl;dr: the convention in regular use is to use decimal numbers, i.e. base 10. Computers can only deal with binary numbers, i.e. base 2. That means the numeral 10 in decimal means 10, whereas the numeral 10 in binary means 2.

Floating point numbers are a standard form used for representing decimal numbers in a computer's memory that deal with the issues that come with trying to convert base 10 numbers to base 2 numbers.

That means floating point numbers don't always work the exact same way as the decimals you're used to. For example, in decimal numbers 0.1 * 3 always equals 0.3. With floating point numbers, 0.1 * 3 is almost but not quite equal to 0.3 due to differences between rounding decimal numbers vs. rounding binary numbers.

To be exact, in floating point numbers 0.1 * 3 is 0.30000000000000004.

Some languages abstract these differences away by using BigDecimal numbers, but that would take away from our basic goal which is just to implement a language so we're not going to do it with Wanda. We'll stick to regular, old JavaScript numbers.

I am working on another language that does use BigDecimals, and I'll update you when there's something to show with that.

The Stages of Our Compiler

To recap the intro article, the compiler has several parts. Our compiler will have the following stages as it transforms a string of Wanda source code into its JavaScript output:

Lexing
Reading (producing an s-expression tree)
Expanding
Parsing (producing an Abstract Syntax Tree (AST))
Desugaring
Emitting

Note that later in the series we'll also add a stage for semantic analysis after the parsing stage.

In addition to compiling numbers, we're also going to start building out a CLI interface to the compiler with a REPL (Read-Eval-Print Loop) that will allow us to compile and evaluate code in a single step.

Most of the heavy lifting today will be done in the lexing phase, reading numbers from the text input into tokens. In fact, the expanding and desugaring phases won't do anything at all; they'll just take their input data structure and return it straight away. I'm just including them now so we don't have to add them to the pipeline later.

Project Setup

I use Prettier for code formatting, with the most basic .prettierrc config possible:

{}

You can run npm init if you want to initialize the project as a package. I have mine scoped to my name, @jasonsbarr/wanda. Note that I'm using Node v20.0.0, but you shouldn't need a brand new version of Node for the code to work.

Make sure that in your package.json file you have the directive "type": "module" so ES2015 modules work as specified.

All code is in the src directory unless specified otherwise.

Lexing

The lexer takes an input string of source code and transforms it into tokens that represent lexemes, which are valid bits of text that can be transformed into syntax objects as part of a tree. It takes in a string and outputs an array of token objects.

A token holds 3 bits of information: the token type, its text value, and an object that contains source location (srcloc) info: file, line, column, and position.

Source Location Info

Let's start by creating the SrcLoc class for the srcloc info. Create a new directory lexer in the src directory and put this in SrcLoc.js:

export class SrcLoc {
  constructor(pos, line, col, file) {
    this.pos = pos;
    this.line = line;
    this.col = col;
    this.file = file;
  }

  static new(pos, line, col, file) {
    return new SrcLoc(pos, line, col, file);
  }
}

Note that I add a static new method to all my classes as a personal preference so creating new objects is just a method call instead of a new expression. That's purely my preference; you can do the same with your classes if you like or leave it off. It's up to you.

The Token Type

In the lexer folder let's add TokenTypes.js for an enum-like object to represent token types. "Number" is the only token type for now:

export const TokenTypes = {
  Number: "Number",
};

Now let's add the Token class in src/lexer/Token.js:

export class Token {
  constructor(type, value, srcloc) {
    this.type = type;
    this.value = value;
    this.srcloc = srcloc;
  }

  static new(type, value, srcloc) {
    return new Token(type, value, srcloc);
  }
}

Helper Functions

Before we get into the lexer proper, we need 2 more things for support:

A set of helper functions to detect certain characters
An object to manage the input string as we traverse it and chunk it into tokens

First, the helper functions (in src/lexer/utils.js). We need to detect dots, digits, whitespace, semicolons (comments begin with a semicolon and continue to the end of the line), newlines, and dashes (minus signs), plus a function for checking if a numeric string is a well-formed number:

// Character matchers
export const isDot = (ch) => /\./.test(ch);
export const isDigit = (ch) => /\d/.test(ch);
export const isWhitespace = (ch) => /\s/.test(ch);
export const isSemicolon = (ch) => /;/.test(ch);
export const isNewline = (ch) => /\n/.test(ch);
export const isDash = (ch) => /\-/.test(ch);

// String matchers
export const isNumber = (str) => /^[+-]?\d+(\.\d+)?$/.test(str);

The InputStream Class

Now a class for the InputStream. This class manages the internal state of the input as the lexer traverses it and breaks it up into tokens.

The InputStream class has 5 methods: eof, lookahead, next, peek, and readWhile.

eof checks to see if the current position is at or past the end of the input string
lookahead gets the character at the position n spots ahead of the current position
next gets the character at the current position and advances the input by 1
peek gets the character at the current position, but does not advance the input
readWhile consumes every character in the input while a condition is true

src/lexer/InputStream.js:

import { isNewline } from "./utils.js";

export class InputStream {
  constructor(input, file) {
    this.input = input;
    this.file = file;
    this.pos = 0;
    this.line = 1;
    this.col = 1;
  }

  static new(input, file) {
    return new InputStream(input, file);
  }

  get length() {
    return this.input.length;
  }

  eof() {
    return this.pos >= this.length;
  }

  lookahead(n = 1) {
    return this.input[this.pos + n];
  }

  next() {
    const ch = this.input[this.pos++];

    if (isNewline(ch)) {
      this.line++;
      this.col = 1;
    } else {
      this.col++;
    }

    return ch;
  }

  peek() {
    return this.input[this.pos];
  }

  readWhile(pred) {
    let str = "";
    while (pred(this.peek())) {
      str += this.next();
    }

    return str;
  }
}

In-Language Exceptions

Let's add a SyntaxException class that inherits from a base Exception class in the language, which just inherits from the built-in JavaScript Error class. In src/shared/exceptions.js:

export class Exception extends Error {
  constructor(msg) {
    super(msg);
  }
}

export class SyntaxException extends Exception {
  constructor(value, srcloc) {
    super(
      `Syntax Exception: invalid syntax ${value} found at ${srcloc.file} (${srcloc.line}:${srcloc.col})`
    );
  }
}

Now we have the machinery set up so we can create the lexer itself.

The Lexer

We'll create a Lexer class that holds an InputStream object and has methods for tokenizing the input.

The main tokenize method creates an array for tokens, then has a loop that continues as long as we're not at the end of the input. Inside the loop we peek at the current character then perform an action based on it.

We'll skip whitespace and comments, and the only token we're reading right now is a number token. We'll do this with the readNumber method.

First, we import these functions and types into src/lexer/Lexer.js:

import { SyntaxException } from "../shared/exceptions.js";
import { SrcLoc } from "./SrcLoc.js";
import { Token } from "./Token.js";
import { TokenTypes } from "./TokenTypes.js";
import {
  isDash,
  isDigit,
  isDot,
  isNewline,
  isNumber,
  isSemicolon,
  isWhitespace,
} from "./utils.js";

The Lexer class constructor is simple:

export class Lexer {
    constructor(input) {
    this.input = input;
  }

  static new(input) {
    return new Lexer(input);
  }
}

For the readNumber method we first get the srcpos info off the input, then check to see if there's a minus sign. If there's a minus sign, add it to the number token. Then read any ensuing numbers and decimal points into the token value. We need to be able to tell if it's a valid or invalid number token, so we're just going to take all the numbers and dots in until there are no more left.

Then, if it's not a valid number, throw a SyntaxException. Otherwise, we create the number token.

In the Lexer class, beneath the new method:

  readNumber() {
    let { pos, line, col, file } = this.input;
    const srcloc = SrcLoc.new(pos, line, col, file);
    let num = "";

    if (isDash(this.input.peek())) {
      num += this.input.next();
    }

    num += this.input.readWhile((ch) => isDigit(ch) || isDot(ch));

    if (!isNumber(num)) {
      throw new SyntaxException(num, srcloc);
    }

    return Token.new(TokenTypes.Number, num, srcloc);
  }

The tokenize method runs a loop and dispatches on the current character. Right now all we're doing is skipping whitespace and comments, and reading number tokens. Anything else will cause a SyntaxException.

In the Lexer class below the readNumber method:

  tokenize() {
    let tokens = [];

    while (!this.input.eof()) {
      let ch = this.input.peek();
      if (isWhitespace(ch)) {
        this.input.readWhile(isWhitespace);
      } else if (isSemicolon(ch)) {
        this.input.readWhile((ch) => !isNewline(ch) && !this.input.eof());
      } else if (isDash(ch) && isDigit(this.input.lookahead(1))) {
        tokens.push(this.readNumber());
      } else if (isDigit(ch)) {
        tokens.push(this.readNumber());
      } else {
        const { pos, line, col, file } = this.input;
        throw new SyntaxException(ch, SrcLoc.new(pos, line, col, file));
      }
    }

    return tokens;
  }

That's all for the Lexer class for now! Now we'll abstract it away into a tokenize function in src/lexer/tokenize.js:

import { InputStream } from "./InputStream.js";
import { Lexer } from "./Lexer.js";

export const tokenize = (input, file) =>
  Lexer.new(InputStream.new(input, file)).tokenize();

Next we'll move onto the reader.

The Reader

"Reader" is the Lisp equivalent of a parser. A conventional reader dispatches based on the current character, similar to how our lexer works, and reads the input stream into a tree of s-expression data structures that are homoiconic with the source code. That means the code and data can essentially be substituted for one another in the programmer's imagination. The underlying data is the same as the outward representation in text.

Our compiler has both a reader and a parser for pedagogical reasons. We'll read the input into s-expressions so we can have all the power and elegance of Lisp macros, but we'll then parse that tree into a fuller AST so I can show you some additional compiler techniques that will help us make the language more interesting.

Technically we could have done all this with the s-expression tree produced by the reader and expander, but it's a little easier to do with a fuller AST and we're prioritizing understandability over efficiency here.

First we'll create a Reader class that manages the token stream in similar fashion to how the InputStream class manages the input string.

The Reader class has 4 methods:

eof, which checks to see if we're at the end of the token stream
next, which returns the current token and advances the stream by 1
peek, which returns the token at the current position
skip, which skips the current position without returning a token

In src/reader/Reader.js:

export class Reader {
  constructor(tokens) {
    this.tokens = tokens;
    this.pos = 0;
  }

  static new(tokens) {
    return new Reader(tokens);
  }

  get length() {
    return this.tokens.length;
  }

  eof() {
    return this.pos >= this.length;
  }

  next() {
    return this.tokens[this.pos++];
  }

  peek() {
    return this.tokens[this.pos];
  }

  skip() {
    this.pos++;
  }
}

The read functions take the Reader as a parameter and read syntactic forms from the token stream. Since we're only reading numbers right now, the read functions are very similar.

The main read function currently returns an array of forms. The only form we're reading right now is a number token. In the future, we'll change the data structure created by read from an array to an s-expression made up of linked lists using cons cells. If you don't know what that means, don't worry about it—I'll show you exactly what that means in a future post. We'll also add additional syntactic forms besides plain tokens and lists.

The read function dispatches to the readForm function, which reads forms from the current position in the token stream. Right now our only form is the number token, which is an atomic form, so readForm just passes the Reader through to readAtom. This will change soon.

In src/reader/read.js, import these types:

import { TokenTypes } from "../lexer/TokenTypes.js";
import { SyntaxException } from "../shared/exceptions.js";
import { Reader } from "./Reader.js";

Here's the readAtom function in its current state:

const readAtom = (reader) => {
  const tok = reader.peek();

  switch (tok.type) {
    case TokenTypes.Number:
      reader.skip();
      return tok;
    default:
      throw new SyntaxException(tok.value, tok.srcloc);
  }
};

As mentioned above, readForm currently just delegates to readAtom:

const readForm = (reader) => {
  return readAtom(reader);
};

And read processes the whole token stream:

export const read = (tokens) => {
  const reader = Reader.new(tokens);
  let parseTree = [];

  while (!reader.eof()) {
    parseTree.push(readForm(reader));
  }

  return parseTree;
};

That's the current iteration of the reader! Soon we'll add additional forms for it to read, including some reader macros that will transform the token stream as it reads it.

The Expander

The expander is simple, since there's nothing yet to expand. It merely takes the output of the reader and returns it. In src/expander/expand.js:

export const expand = (parseTree) => parseTree;

The Parser

Unlike most Lisp languages, this one has a separate parsing stage that converts the s-expression tree into a JSON-compatible AST made up of object nodes.

Since we're only compiling numbers for now, there are only 2 AST nodes: Program and NumberLiteral. I'm writing the AST module as an object of node constructor methods, rather than as a set of classes.

Every node has the properties type and srcloc that contain the node type and srcloc info, respectively. Different node types have different properties based on the data needed to represent the language form they signify.

The Program node has a body property that is an array of nodes that make up the program body. NumberLiteral has a value property that is the text value from the number token.

In src/parser/ast.js:

export const ASTTypes = {
  Program: "Program",
  NumberLiteral: "NumberLiteral",
};

export const AST = {
  Program(exprs) {
    return {
      type: ASTTypes.Program,
      body: exprs,
      srcloc: exprs[0]?.srcloc ?? SrcLoc.new(0, 0, 0, "none"),
    };
  },
  NumberLiteral(token) {
    return {
      type: ASTTypes.NumberLiteral,
      value: token.value,
      srcloc: token.srcloc,
    };
  },
};

The parser also uses the Reader class from the reader to manage its input state. This will change when we edit the reader to output a cons data structure instead of a JavaScript array, but it's good enough for now.

In src/parser/parse.js, import the following:

import { TokenTypes } from "../lexer/TokenTypes.js";
import { SyntaxException } from "../shared/exceptions.js";
import { AST } from "./ast.js";
import { Reader } from "../reader/Reader.js";

Like with the reader, we have a main parse function that dispatches to the parseExpr function. Since our only expression type right now is numbers, all parseExpr currently does is dispatch to parsePrimitive.

The parser will change quite a bit when we start returning a linked list from the reader instead of an array, but this will do for now:

const parsePrimitive = (reader) => {
  const token = reader.peek();

  switch (token.type) {
    case TokenTypes.Number:
      reader.skip();
      return AST.NumberLiteral(token);
    default:
      throw new SyntaxException(token.value, token.srcloc);
  }
};

const parseExpr = (reader) => {
  return parsePrimitive(reader);
};

export const parse = (readTree) => {
  let body = [];
  const reader = Reader.new(readTree);

  while (!reader.eof()) {
    body.push(parseExpr(reader));
  }

  return AST.Program(body);
};

That's it for the parser. In languages with more complicated syntax, parsing is a huge and complicated undertaking. Here, it's pretty simple. Even as we add more forms to the language that require parsing, it will still be much simpler for us than it would be in a language with more complicated syntax.

If there's interest, I may write a tutorial for a more complex parser in the future. Let me know in the comments if that's something you'd like to see!

The Desugarer

The desugarer is the first stage in the back end of the compiler. It takes higher, more complex forms in the language and converts them into simpler core forms. Sometimes you'll see this process called "lowering."

Like the expander above, the desugarer has nothing to do right now since we only have a single form in our language. That will change in the near future. For now, just put this in src/desugarer/desugar.js:

export const desugar = (ast) => ast;

And with that, we're ready to build the code emitter!

The Emitter

The emitter for our language is going to be fairly simple as well. It simply walks the desugared AST and emits code based on the nodes.

It uses a design pattern called the Visitor Pattern to "visit" each node of the tree and perform an action based on the node type. In this case, the action is emitting the code.

The emitter uses postorder traversal to visit the nodes, meaning it acts on the children at the bottom of the hierarchy before it acts on their parent nodes.

The Dispatcher Method

The emit method dispatches node-specific emitter methods depending on the node type. It uses a switch statement on the node.type property.

Since the only node types we have right now are Program and NumberLiteral, those are the only nodes we have emitters for.

The default case in emit throws a SyntaxException, though this is only for completeness to make sure we remember to implement emitter methods for every core node type.

In src/emitter/Emitter.js:

import { ASTTypes } from "../parser/ast.js";
import { SyntaxException } from "../shared/exceptions.js";

export class Emitter {
  constructor(program) {
    this.program = program;
  }

  static new(program) {
    return new Emitter(program);
  }

  emit(node = this.program) {
    switch (node.type) {
      case ASTTypes.Program:
        return this.emitProgram(node);
      case ASTTypes.NumberLiteral:
        return this.emitNumber(node);
      default:
        throw new SyntaxException(node.type, node.srcloc);
    }
  }

  emitNumber(node) {
    return node.value;
  }

  emitProgram(node) {
    let code = "";

    for (let n of node.body) {
      code += this.emit(n);
    }

    return code;
  }
}

Then in src/emitter/emit.js:

import { Emitter } from "./Emitter.js";

export const emit = (ast) => Emitter.new(ast).emit();

With that, we can create the whole compilation pipeline from Wanda input to JavaScript output.

The CLI

We're not going to build out a whole big CLI application just yet. I'll dedicate a whole article to that in the future. It would be nice to have some streamlined functions to let us compile and run the code though, as well as a REPL so we can run test programs.

Let's create a simple compilation pipeline in src/cli/compile.js:

import { tokenize } from "../lexer/tokenize.js";
import { read } from "../reader/read.js";
import { expand } from "../expander/expand.js";
import { parse } from "../parser/parse.js";
import { desugar } from "../desugarer/desugar.js";
import { emit } from "../emitter/emit.js";

export const compile = (input, file = "stdin") =>
  emit(desugar(parse(expand(read(tokenize(input, file))))));

Then an evaluator function using the Node.js vm module in src/cli/eval.js:

import vm from "vm";
import { compile } from "./compile.js";

export const EVAL = (input) => vm.runInThisContext(compile(input));

Note that EVAL is in all caps because eval is a reserved word in JavaScript.

For our simple REPL we'll need a way to get input from the console. The simplest way to do that is just to install the readline-sync package from NPM:

npm install readline-sync

And here's the simplest of REPLs in src/cli/repl.js:

import readlineSync from "readline-sync";
import { EVAL } from "./eval.js";
import { print } from "../printer/print.js";

const read = (prompt) => readlineSync.question(prompt);

export const repl = () => {
  let prompt = "wanda> ";

  while (true) {
    try {
      let result = EVAL(read(prompt)) ?? "";

      if (result === "") {
        process.exit(0);
      }

      print(result);
    } catch (e) {
      console.error(e.message);
    }
  }
};

Finally, we run the CLI by simply calling the repl function in src/cli/run.js:

import { repl } from "./repl.js";

repl();

Then run the REPL with node src/cli/run.js and enter numbers to your heart's content!

Note that any input that is not a valid number (integer or floating point, exponential notation not supported) will cause a SyntaxException. We'll start fixing that in the next article in the series.

Conclusion

Whew! That was a lot. But you've come a long way – you now have a fully functional compiler and evaluator with REPL! Granted, you can only run the simplest of programs, numeric literals, but we're going to make things much more interesting as we progress through the series.

I hope you understand better now how programming languages and compilers work, and I'm looking forward to adding additional features to Wanda with you in the days ahead.

You can see the code from this article in the Wanda GitHub repo. I'll conveniently tag the code from each article so you can see how things evolve as we move along. The repo code also has JSDoc comments for most of the functions and types, though I can't promise I've remembered to document every single one. There are also tests. You can run the test suite with the npm test command.

See you next time!

DEV Community