Introduction
I write JavaScript nearly every day just now and love it, but I don’t really know what happens to my code once I run it. Normally I run it through node, or the browser, and sometimes it works.
But I want to know what’s actually happening beneath the surface. How does my syntax become a working program? I’m no computer scientist, so I’m not going super deep into this, but I want to at least have a superficial understanding of my tools.
As part of this, I thought it would be a fun exercise to try and create a very, very basic programming language in JavaScript. I’m more or less following Steve Kinney’s path in this, so I’ve got a good example.
What does a programming language consist of?
So a programming language needs a way to understand the syntax it has been provided, and a way to interpret or compile these instructions into machine-readable code. Effectively we are turning our high level code into slightly lower level code.
I’m keeping this very simple and building a basic lexical analysis tool or lexer
and a simple syntactic analysis tool, or AST
(Abstract Syntax Tree).
This will accept my string of syntax, tokenize it, and then run the logic.
The current post will focus on building our lexer, and a subsequent post will handle our AST and related tooling.
Building a lexer in JavaScript
So a lexer basically takes a string of code and splits it into individual elements, or ‘tokens’. A token is just a small unit of the language. For example, look at this string in JavaScript:
sum(2, 1)
A lexer will split it into individual elements like this:
sum
+ (
+ 2
+ ,
+ 1
+ )
We effectively accept a string of code, iterate through each character, and check each character to see if it matches a pre-defined set of tokens. If so, we add it to our tokens collection and return them at the end to be interpreted.
Getting started
Now we have a rough idea of how a lexer works, let’s get started building one! Firstly, we’ll make some helper functions to determine character types:
const LETTER = /[a-zA-Z]/
const WHITESPACE = /\s+/
const NUMBER = /^[0-9]+$/
const OPERATORS = ["+", "-", "*", "/", "%"]
const isLetter = character => LETTER.test(character)
const isWhitespace = character => WHITESPACE.test(character)
const isNumber = character => NUMBER.test(character)
const isOpeneningParenthesis = character => character === "("
const isClosingParenthesis = character => character === ")"
const isParenthesis = character =>
isOpeneningParenthesis(character) || isClosingParenthesis(character)
const isQuote = character => character === '"'
const isOperator = character => OPERATORS.includes(character)
const helpers = {
isLetter,
isWhitespace,
isNumber,
isOpeneningParenthesis,
isClosingParenthesis,
isParenthesis,
isQuote,
isOperator,
}
export default helpers
As we can see here, we have a number of methods which accept a character and run a simple RegEx (regular expression) on it to determine whether it matches a pre-determined type which we have created as a constant at the top of the file. In particular, we are looking for letters, whitespace, numbers, and operators.
Because the language we are building is inspired by Lisp, we will definitely need to know about parentheses, so we create specific helpers for these.
Building our token parser
Now we have some helpers to determine the characters we are working with, we want to put them to use! So let’s build a simple tokeniser:
import helpers from './helpers';
const tokenize = (input) => {
const tokens = [];
let cursor = 0;
while (cursor < input.length) {
const character = input[cursor];
if (helpers.isParenthesis(character)) {
tokens.push({
type: 'Parenthesis',
value: character,
});
cursor++;
continue;
}
cursor++;
continue;
}
throw new Error(`${character} is not valid.`);
}
return tokens;
};
export default tokenize;
Let’s walk through this. Firstly we define our tokenize
function and accept an input.
Next, we create an empty array for our tokens which we’ll fill later. We also create a cursor variable which we’ll use to track our position in the input.
With that initial setup done, let’s walk through the input. We’re using a while loop here since it’s fast and allows us a good deal of control over our cursor position. We could also use something like reduce
but we could work with some very big inputs in theory and this can give us performance issues along with making it harder to control exactly where the cursor is (but please get in touch if you have a cool way of doing this).
So we traverse the length of our input which is the code, and we assign the current position to our character
variable for the sake of legibility.
Time to run our first check! We want to see if it is an opening or closing parenthesis. To do this, we use our isParenthesis
helper and if so, we push an object to our tokens
array providing the type and value. So we could express this in a test:
it("should tokenize a pair of parentheses", () => {
const input = "()"
const result = [
{ type: "Parenthesis", value: "(" },
{ type: "Parenthesis", value: ")" },
]
expect(tokenize(input)).toEqual(result)
})
So now we’re capturing parentheses, we want to figure out the rest of our tokens:
if (helpers.isWhitespace(character)) {
cursor++;
continue;
}
if (helpers.isNumber(character)) {
let number = character;
/**
* We want to account for multi-digit numbers, so we
* look ahead in our string to see if the next character
* is a number. We assume white space is the end of a number.
*/
while (helpers.isNumber(input[++cursor])) {
number += input[cursor];
}
tokens.push({
type: 'Number',
value: parseInt(number, 10),
});
continue;
}
if (helpers.isLetter(character)) {
let symbol = character;
/**
* We want to account for words, so we look ahead in our
* string to see if the next character is a letter.
*
* We assume white space is the end of a word.
*/
while (helpers.isLetter(input[++cursor])) {
symbol += input[cursor];
}
tokens.push({
type: 'Name',
value: symbol,
});
continue;
}
if (helpers.isQuote(character)) {
let string = '';
while (!helpers.isQuote(input[++cursor])) {
string += input[cursor];
}
tokens.push({
type: 'String',
value: string,
});
cursor++;
continue;
}
```
Some of these are simple such as a check for whitespace, but others are more complex, so we’ll dig into these.
Tokenizing Digits
Tokenizing a single digit is pretty straightforward, but it becomes more complex with multi-digit numbers. If we didn’t take this into account, we could have 101
as an input but it would be split into 1
, 0
, 1
. This could be pretty disastrous for our tiny language!
So instead we need to look ahead of our current character and see if the next item is a number as well. If so, we can assume that it is a continuous number. So we introduce a while
loop and increment our cursor to see that the next character is a number. If so, we append it to our current number
variable, until we reach the end of the number.
As some example tests, we can do this:
it("should tokenize a single digit", () => {
const input = "3"
const result = [{ type: "Number", value: 3 }]
expect(tokenize(input)).toEqual(result)
})
it("should tokenize a continuous number", () => {
const input = "33"
const result = [{ type: "Number", value: 33 }]
expect(tokenize(input)).toEqual(result)
})
Tokenizing Words
The logic for tokenizing a word is more or less the same here so you can refer to the same logic, but for an example test:
it("should tokenize a continuous Name", () => {
const input = "abc"
const result = [{ type: "Name", value: "abc" }]
expect(tokenize(input)).toEqual(result)
})
Tokenizing Quotes
Finally, we want to be able to handle strings inside quotes. There are a few gotchas here which haven’t been implemented, like parsing single and double quotes and escaping strings, but for our purposes it works fine.
In this case we don’t really care about the quotation marks other than the fact they operate as boundaries for the beginning and end of a quoted string. To account for this, we inverse the logic and for every item which is not a quote mark, we add it to our string
variable. When we hit our closing quotation, the loop breaks and we continue to iterate the tokenizer.
As a simple test, we could run:
it("should handle a quoted string", () => {
const input = '"hello"'
const result = [{ type: "String", value: "hello" }]
expect(tokenize(input)).toEqual(result)
})
Finished Result
All in, your code should look something like this:
import helpers from "./helpers"
const tokenize = input => {
const tokens = []
let cursor = 0
while (cursor < input.length) {
const character = input[cursor]
if (helpers.isParenthesis(character)) {
tokens.push({
type: "Parenthesis",
value: character,
})
cursor++
continue
}
if (helpers.isWhitespace(character)) {
cursor++
continue
}
if (helpers.isNumber(character)) {
let number = character
/**
* We want to account for multi-digit numbers, so we
* look ahead in our string to see if the next character
* is a number. We assume white space is the end of a number.
*/
while (helpers.isNumber(input[++cursor])) {
number += input[cursor]
}
tokens.push({
type: "Number",
value: parseInt(number, 10),
})
continue
}
if (helpers.isLetter(character)) {
let symbol = character
/**
* We want to account for words, so we look ahead in our
* string to see if the next character is a letter.
*
* We assume white space is the end of a word.
*/
while (helpers.isLetter(input[++cursor])) {
symbol += input[cursor]
}
tokens.push({
type: "Name",
value: symbol,
})
continue
}
if (helpers.isQuote(character)) {
let string = ""
while (!helpers.isQuote(input[++cursor])) {
string += input[cursor]
}
tokens.push({
type: "String",
value: string,
})
cursor++
continue
}
throw new Error(`${character} is not valid.`)
}
return tokens
}
export default tokenize
Conclusion
And…that’s pretty much it! Although a lexical analysis tool sounds pretty tricky, the basic version is actually quite simple to make.
Don’t be fooled though, to build an actually useful one would take a lot of time and effort. Yes, JavaScript was famously written in ten days, but that’s more a testament to Brendan Eich’s skill than the complexity of the task. This stuff really is hard!
With that said, we’ve done a good job today. Going from zero to a functional lexer is no mean feat and we got there!
Next step is to write an AST to break the code into a more meaningful structure so that we can see what our tokens want to achieve, and then transpile this into JavaScript, and we will do precisely that in another post.
Top comments (0)