DEV Community

Jason Barr

Posted on • Updated on

Create Your Own Programming Language 1: Numbers

In today's post we're going to scaffold out the compiler for Wanda and add numbers to the language.

Numbers will be represented by the JavaScript number type, which is a 64-bit floating point number.

Floating Point Numbers vs. Decimals

If you're not sure what that means, here's the tl;dr: the convention in regular use is to use decimal numbers, i.e. base 10. Computers can only deal with binary numbers, i.e. base 2. That means the numeral 10 in decimal means 10, whereas the numeral 10 in binary means 2.

Floating point numbers are a standard form used for representing decimal numbers in a computer's memory that deal with the issues that come with trying to convert base 10 numbers to base 2 numbers.

That means floating point numbers don't always work the exact same way as the decimals you're used to. For example, in decimal numbers 0.1 * 3 always equals 0.3. With floating point numbers, 0.1 * 3 is almost but not quite equal to 0.3 due to differences between rounding decimal numbers vs. rounding binary numbers.

To be exact, in floating point numbers 0.1 * 3 is 0.30000000000000004.

Some languages abstract these differences away by using BigDecimal numbers, but that would take away from our basic goal which is just to implement a language so we're not going to do it with Wanda. We'll stick to regular, old JavaScript numbers.

I am working on another language that does use BigDecimals, and I'll update you when there's something to show with that.

The Stages of Our Compiler

To recap the intro article, the compiler has several parts. Our compiler will have the following stages as it transforms a string of Wanda source code into its JavaScript output:

1. Lexing
2. Reading (producing an s-expression tree)
3. Expanding
4. Parsing (producing an Abstract Syntax Tree (AST))
5. Desugaring
6. Emitting

Note that later in the series we'll also add a stage for semantic analysis after the parsing stage.

In addition to compiling numbers, we're also going to start building out a CLI interface to the compiler with a REPL (Read-Eval-Print Loop) that will allow us to compile and evaluate code in a single step.

Most of the heavy lifting today will be done in the lexing phase, reading numbers from the text input into tokens. In fact, the expanding and desugaring phases won't do anything at all; they'll just take their input data structure and return it straight away. I'm just including them now so we don't have to add them to the pipeline later.

Project Setup

I use Prettier for code formatting, with the most basic `.prettierrc` config possible:

``````{}
``````

You can run `npm init` if you want to initialize the project as a package. I have mine scoped to my name, `@jasonsbarr/wanda`. Note that I'm using Node v20.0.0, but you shouldn't need a brand new version of Node for the code to work.

Make sure that in your `package.json` file you have the directive `"type": "module"` so ES2015 modules work as specified.

All code is in the `src` directory unless specified otherwise.

Lexing

The lexer takes an input string of source code and transforms it into tokens that represent lexemes, which are valid bits of text that can be transformed into syntax objects as part of a tree. It takes in a string and outputs an array of token objects.

A token holds 3 bits of information: the token type, its text value, and an object that contains source location (srcloc) info: file, line, column, and position.

Source Location Info

Let's start by creating the `SrcLoc` class for the srcloc info. Create a new directory `lexer` in the `src` directory and put this in `SrcLoc.js`:

``````export class SrcLoc {
constructor(pos, line, col, file) {
this.pos = pos;
this.line = line;
this.col = col;
this.file = file;
}

static new(pos, line, col, file) {
return new SrcLoc(pos, line, col, file);
}
}
``````

Note that I add a static `new` method to all my classes as a personal preference so creating new objects is just a method call instead of a `new` expression. That's purely my preference; you can do the same with your classes if you like or leave it off. It's up to you.

The Token Type

In the `lexer` folder let's add `TokenTypes.js` for an enum-like object to represent token types. "Number" is the only token type for now:

``````export const TokenTypes = {
Number: "Number",
};
``````

Now let's add the `Token` class in `src/lexer/Token.js`:

``````export class Token {
constructor(type, value, srcloc) {
this.type = type;
this.value = value;
this.srcloc = srcloc;
}

static new(type, value, srcloc) {
return new Token(type, value, srcloc);
}
}
``````

Helper Functions

Before we get into the lexer proper, we need 2 more things for support:

1. A set of helper functions to detect certain characters
2. An object to manage the input string as we traverse it and chunk it into tokens

First, the helper functions (in `src/lexer/utils.js`). We need to detect dots, digits, whitespace, semicolons (comments begin with a semicolon and continue to the end of the line), newlines, and dashes (minus signs), plus a function for checking if a numeric string is a well-formed number:

``````// Character matchers
export const isDot = (ch) => /\./.test(ch);
export const isDigit = (ch) => /\d/.test(ch);
export const isWhitespace = (ch) => /\s/.test(ch);
export const isSemicolon = (ch) => /;/.test(ch);
export const isNewline = (ch) => /\n/.test(ch);
export const isDash = (ch) => /\-/.test(ch);

// String matchers
export const isNumber = (str) => /^[+-]?\d+(\.\d+)?\$/.test(str);
``````

The InputStream Class

Now a class for the `InputStream`. This class manages the internal state of the input as the lexer traverses it and breaks it up into tokens.

The `InputStream` class has 5 methods: `eof`, `lookahead`, `next`, `peek`, and `readWhile`.

• `eof` checks to see if the current position is at or past the end of the input string
• `lookahead` gets the character at the position n spots ahead of the current position
• `next` gets the character at the current position and advances the input by 1
• `peek` gets the character at the current position, but does not advance the input
• `readWhile` consumes every character in the input while a condition is true

`src/lexer/InputStream.js`:

``````import { isNewline } from "./utils.js";

export class InputStream {
constructor(input, file) {
this.input = input;
this.file = file;
this.pos = 0;
this.line = 1;
this.col = 1;
}

static new(input, file) {
return new InputStream(input, file);
}

get length() {
return this.input.length;
}

eof() {
return this.pos >= this.length;
}

return this.input[this.pos + n];
}

next() {
const ch = this.input[this.pos++];

if (isNewline(ch)) {
this.line++;
this.col = 1;
} else {
this.col++;
}

return ch;
}

peek() {
return this.input[this.pos];
}

let str = "";
while (pred(this.peek())) {
str += this.next();
}

return str;
}
}
``````

In-Language Exceptions

Let's add a `SyntaxException` class that inherits from a base `Exception` class in the language, which just inherits from the built-in JavaScript `Error` class. In `src/shared/exceptions.js`:

``````export class Exception extends Error {
constructor(msg) {
super(msg);
}
}

export class SyntaxException extends Exception {
constructor(value, srcloc) {
super(
`Syntax Exception: invalid syntax \${value} found at \${srcloc.file} (\${srcloc.line}:\${srcloc.col})`
);
}
}
``````

Now we have the machinery set up so we can create the lexer itself.

The Lexer

We'll create a Lexer class that holds an `InputStream` object and has methods for tokenizing the input.

The main `tokenize` method creates an array for tokens, then has a loop that continues as long as we're not at the end of the input. Inside the loop we `peek` at the current character then perform an action based on it.

We'll skip whitespace and comments, and the only token we're reading right now is a number token. We'll do this with the `readNumber` method.

First, we import these functions and types into `src/lexer/Lexer.js`:

``````import { SyntaxException } from "../shared/exceptions.js";
import { SrcLoc } from "./SrcLoc.js";
import { Token } from "./Token.js";
import { TokenTypes } from "./TokenTypes.js";
import {
isDash,
isDigit,
isDot,
isNewline,
isNumber,
isSemicolon,
isWhitespace,
} from "./utils.js";
``````

The `Lexer` class constructor is simple:

``````export class Lexer {
constructor(input) {
this.input = input;
}

static new(input) {
return new Lexer(input);
}
}
``````

For the `readNumber` method we first get the `srcpos` info off the input, then check to see if there's a minus sign. If there's a minus sign, add it to the number token. Then read any ensuing numbers and decimal points into the token value. We need to be able to tell if it's a valid or invalid number token, so we're just going to take all the numbers and dots in until there are no more left.

Then, if it's not a valid number, throw a `SyntaxException`. Otherwise, we create the number token.

In the `Lexer` class, beneath the `new` method:

``````  readNumber() {
let { pos, line, col, file } = this.input;
const srcloc = SrcLoc.new(pos, line, col, file);
let num = "";

if (isDash(this.input.peek())) {
num += this.input.next();
}

num += this.input.readWhile((ch) => isDigit(ch) || isDot(ch));

if (!isNumber(num)) {
throw new SyntaxException(num, srcloc);
}

}
``````

The `tokenize` method runs a loop and dispatches on the current character. Right now all we're doing is skipping whitespace and comments, and reading number tokens. Anything else will cause a `SyntaxException`.

In the `Lexer` class below the `readNumber` method:

``````  tokenize() {
let tokens = [];

while (!this.input.eof()) {
let ch = this.input.peek();
if (isWhitespace(ch)) {
} else if (isSemicolon(ch)) {
} else if (isDash(ch) && isDigit(this.input.lookahead(1))) {
} else if (isDigit(ch)) {
} else {
const { pos, line, col, file } = this.input;
throw new SyntaxException(ch, SrcLoc.new(pos, line, col, file));
}
}

}
``````

That's all for the `Lexer` class for now! Now we'll abstract it away into a `tokenize` function in `src/lexer/tokenize.js`:

``````import { InputStream } from "./InputStream.js";
import { Lexer } from "./Lexer.js";

export const tokenize = (input, file) =>
Lexer.new(InputStream.new(input, file)).tokenize();
``````

Next we'll move onto the reader.

"Reader" is the Lisp equivalent of a parser. A conventional reader dispatches based on the current character, similar to how our lexer works, and reads the input stream into a tree of s-expression data structures that are homoiconic with the source code. That means the code and data can essentially be substituted for one another in the programmer's imagination. The underlying data is the same as the outward representation in text.

Our compiler has both a reader and a parser for pedagogical reasons. We'll read the input into s-expressions so we can have all the power and elegance of Lisp macros, but we'll then parse that tree into a fuller AST so I can show you some additional compiler techniques that will help us make the language more interesting.

Technically we could have done all this with the s-expression tree produced by the reader and expander, but it's a little easier to do with a fuller AST and we're prioritizing understandability over efficiency here.

First we'll create a `Reader` class that manages the token stream in similar fashion to how the `InputStream` class manages the input string.

The `Reader` class has 4 methods:

• `eof`, which checks to see if we're at the end of the token stream
• `next`, which returns the current token and advances the stream by 1
• `peek`, which returns the token at the current position
• `skip`, which skips the current position without returning a token

In `src/reader/Reader.js`:

``````export class Reader {
constructor(tokens) {
this.tokens = tokens;
this.pos = 0;
}

static new(tokens) {
}

get length() {
return this.tokens.length;
}

eof() {
return this.pos >= this.length;
}

next() {
return this.tokens[this.pos++];
}

peek() {
return this.tokens[this.pos];
}

skip() {
this.pos++;
}
}
``````

The `read` functions take the `Reader` as a parameter and read syntactic forms from the token stream. Since we're only reading numbers right now, the `read` functions are very similar.

The main `read` function currently returns an array of forms. The only form we're reading right now is a number token. In the future, we'll change the data structure created by `read` from an array to an s-expression made up of linked lists using cons cells. If you don't know what that means, don't worry about it—I'll show you exactly what that means in a future post. We'll also add additional syntactic forms besides plain tokens and lists.

The `read` function dispatches to the `readForm` function, which reads forms from the current position in the token stream. Right now our only form is the number token, which is an atomic form, so `readForm` just passes the `Reader` through to `readAtom`. This will change soon.

In `src/reader/read.js`, import these types:

``````import { TokenTypes } from "../lexer/TokenTypes.js";
import { SyntaxException } from "../shared/exceptions.js";
``````

Here's the `readAtom` function in its current state:

``````const readAtom = (reader) => {

switch (tok.type) {
case TokenTypes.Number:
default:
throw new SyntaxException(tok.value, tok.srcloc);
}
};
``````

As mentioned above, `readForm` currently just delegates to `readAtom`:

``````const readForm = (reader) => {
};
``````

And `read` processes the whole token stream:

``````export const read = (tokens) => {
let parseTree = [];

}

return parseTree;
};
``````

The Expander

The expander is simple, since there's nothing yet to expand. It merely takes the output of the reader and returns it. In `src/expander/expand.js`:

``````export const expand = (parseTree) => parseTree;
``````

The Parser

Unlike most Lisp languages, this one has a separate parsing stage that converts the s-expression tree into a JSON-compatible AST made up of object nodes.

Since we're only compiling numbers for now, there are only 2 AST nodes: `Program` and `NumberLiteral`. I'm writing the AST module as an object of node constructor methods, rather than as a set of classes.

Every node has the properties `type` and `srcloc` that contain the node type and `srcloc` info, respectively. Different node types have different properties based on the data needed to represent the language form they signify.

The `Program` node has a `body` property that is an array of nodes that make up the program body. `NumberLiteral` has a `value` property that is the text value from the number token.

In `src/parser/ast.js`:

``````export const ASTTypes = {
Program: "Program",
NumberLiteral: "NumberLiteral",
};

export const AST = {
Program(exprs) {
return {
type: ASTTypes.Program,
body: exprs,
srcloc: exprs[0]?.srcloc ?? SrcLoc.new(0, 0, 0, "none"),
};
},
NumberLiteral(token) {
return {
type: ASTTypes.NumberLiteral,
value: token.value,
srcloc: token.srcloc,
};
},
};
``````

The parser also uses the `Reader` class from the reader to manage its input state. This will change when we edit the reader to output a cons data structure instead of a JavaScript array, but it's good enough for now.

In `src/parser/parse.js`, import the following:

``````import { TokenTypes } from "../lexer/TokenTypes.js";
import { SyntaxException } from "../shared/exceptions.js";
import { AST } from "./ast.js";
``````

Like with the reader, we have a main `parse` function that dispatches to the `parseExpr` function. Since our only expression type right now is numbers, all `parseExpr` currently does is dispatch to `parsePrimitive`.

The parser will change quite a bit when we start returning a linked list from the reader instead of an array, but this will do for now:

``````const parsePrimitive = (reader) => {

switch (token.type) {
case TokenTypes.Number:
return AST.NumberLiteral(token);
default:
throw new SyntaxException(token.value, token.srcloc);
}
};

const parseExpr = (reader) => {
};

export const parse = (readTree) => {
let body = [];

}

return AST.Program(body);
};
``````

That's it for the parser. In languages with more complicated syntax, parsing is a huge and complicated undertaking. Here, it's pretty simple. Even as we add more forms to the language that require parsing, it will still be much simpler for us than it would be in a language with more complicated syntax.

If there's interest, I may write a tutorial for a more complex parser in the future. Let me know in the comments if that's something you'd like to see!

The Desugarer

The desugarer is the first stage in the back end of the compiler. It takes higher, more complex forms in the language and converts them into simpler core forms. Sometimes you'll see this process called "lowering."

Like the expander above, the desugarer has nothing to do right now since we only have a single form in our language. That will change in the near future. For now, just put this in `src/desugarer/desugar.js`:

``````export const desugar = (ast) => ast;
``````

And with that, we're ready to build the code emitter!

The Emitter

The emitter for our language is going to be fairly simple as well. It simply walks the desugared AST and emits code based on the nodes.

It uses a design pattern called the Visitor Pattern to "visit" each node of the tree and perform an action based on the node type. In this case, the action is emitting the code.

The emitter uses postorder traversal to visit the nodes, meaning it acts on the children at the bottom of the hierarchy before it acts on their parent nodes.

The Dispatcher Method

The `emit` method dispatches node-specific emitter methods depending on the node type. It uses a `switch` statement on the `node.type` property.

Since the only node types we have right now are `Program` and `NumberLiteral`, those are the only nodes we have emitters for.

The `default` case in `emit` throws a `SyntaxException`, though this is only for completeness to make sure we remember to implement emitter methods for every core node type.

In `src/emitter/Emitter.js`:

``````import { ASTTypes } from "../parser/ast.js";
import { SyntaxException } from "../shared/exceptions.js";

export class Emitter {
constructor(program) {
this.program = program;
}

static new(program) {
return new Emitter(program);
}

emit(node = this.program) {
switch (node.type) {
case ASTTypes.Program:
return this.emitProgram(node);
case ASTTypes.NumberLiteral:
return this.emitNumber(node);
default:
throw new SyntaxException(node.type, node.srcloc);
}
}

emitNumber(node) {
return node.value;
}

emitProgram(node) {
let code = "";

for (let n of node.body) {
code += this.emit(n);
}

return code;
}
}
``````

Then in `src/emitter/emit.js`:

``````import { Emitter } from "./Emitter.js";

export const emit = (ast) => Emitter.new(ast).emit();
``````

With that, we can create the whole compilation pipeline from Wanda input to JavaScript output.

The CLI

We're not going to build out a whole big CLI application just yet. I'll dedicate a whole article to that in the future. It would be nice to have some streamlined functions to let us compile and run the code though, as well as a REPL so we can run test programs.

Let's create a simple compilation pipeline in `src/cli/compile.js`:

``````import { tokenize } from "../lexer/tokenize.js";
import { expand } from "../expander/expand.js";
import { parse } from "../parser/parse.js";
import { desugar } from "../desugarer/desugar.js";
import { emit } from "../emitter/emit.js";

export const compile = (input, file = "stdin") =>
``````

Then an evaluator function using the Node.js `vm` module in `src/cli/eval.js`:

``````import vm from "vm";
import { compile } from "./compile.js";

export const EVAL = (input) => vm.runInThisContext(compile(input));
``````

Note that `EVAL` is in all caps because `eval` is a reserved word in JavaScript.

For our simple REPL we'll need a way to get input from the console. The simplest way to do that is just to install the `readline-sync` package from NPM:

``````npm install readline-sync
``````

And here's the simplest of REPLs in `src/cli/repl.js`:

``````import readlineSync from "readline-sync";
import { EVAL } from "./eval.js";
import { print } from "../printer/print.js";

export const repl = () => {
let prompt = "wanda> ";

while (true) {
try {
let result = EVAL(read(prompt)) ?? "";

if (result === "") {
process.exit(0);
}

print(result);
} catch (e) {
console.error(e.message);
}
}
};
``````

Finally, we run the CLI by simply calling the `repl` function in `src/cli/run.js`:

``````import { repl } from "./repl.js";

repl();
``````

Then run the REPL with `node src/cli/run.js` and enter numbers to your heart's content!

Note that any input that is not a valid number (integer or floating point, exponential notation not supported) will cause a `SyntaxException`. We'll start fixing that in the next article in the series.

Conclusion

Whew! That was a lot. But you've come a long way – you now have a fully functional compiler and evaluator with REPL! Granted, you can only run the simplest of programs, numeric literals, but we're going to make things much more interesting as we progress through the series.

I hope you understand better now how programming languages and compilers work, and I'm looking forward to adding additional features to Wanda with you in the days ahead.

You can see the code from this article in the Wanda GitHub repo. I'll conveniently tag the code from each article so you can see how things evolve as we move along. The repo code also has JSDoc comments for most of the functions and types, though I can't promise I've remembered to document every single one. There are also tests. You can run the test suite with the `npm test` command.

See you next time!