Kerri Shotts

Posted on Jul 2, 2020 • Originally published at able.bio on Jul 2, 2020

Building a BASIC Interpreter, '80s style

#retrocomputing #javascript #assemblylanguage #basic

It’s funny the rabbit holes one ends up. One of my personal projects for several years has been the creation (exploration, really) of a “fake emulator” — that is, an emulator for a computer that never existed all written in JavaScript. Instead, the machine would pay homage to the eight and sixteen bit machines of the 1980s and ‘90s.

I like to do things the hard way, though: this machine would also be based on a novel instruction set as well. The instruction set would be similar to that of the era, but also be a little easier to use. And so, Retroputer was born. Over several years, the implementation has been built out and improved, although it’ll probably never be “complete” (it’s a personal exploration, after all).

Then @bbcmicrobot became a thing, and I wanted to be able to do a similar thing for Retroputer. My JS development skills are mostly in the arena of the front-end, and so this would be a cool way to get some more back-end skills. One problem: Retroputer could only understand its own assembly language. It had no BASIC support yet.

And so here I am, building a BASIC interpreter, ‘80s style — that is, entirely in assembly language, just like it used to be done. And I thought I’d share that journey, since it’s not often we delve into areas so far from our typical abstractions. My daily driver (JavaScript) makes a lot of things trivial and sometimes those things feel magical. Understanding the lowest levels of the process can often help with understanding those abstractions.

And so… let’s begin.

Parsing in Low-Level Assembly Language

When I wrote the assembler for Retroputer, I was able to use a really nice tool called Pegjs. This made quick work of the assembler’s custom syntax, but was unfortunately there’s nothing like it for Retroputer ASM.

Which means we have to do it the hard way.

Parsing actually occurs in multiple phases. A language that uses a compiler parses the code into an abstract syntax tree (or similar concept), and can then use that tree to generate the resulting native code. A consequence of this is that the program must be syntactically correct in order for the compilation to be successful.

Some interpreters today also have this concept because it’s often useful to generate the intermediate AST and execute from there than it is to execute from the original source.

But for a BASIC interpreter in a machine with limited resources, the most resource-effective way to parse is to do it in multiple phases — some of which occurs at run time. This means, however, that syntax errors often can’t be detected until the program is run and the area of code with the error is encountered.

The three phases of Retroputer BASIC parsing are as follows:

Line Transformation
Tokenization
Runtime Syntax Checking

The first two steps occur as the user enters a program (or loads one). The last one occurs while the program is running. Essentially, the first two build out the rough scaffolding of an airplane, but with no guarantee of flight. The last step is essentially acting as a test pilot — hoping you’ll get off the ground, but not knowing until you try.

Thankfully Retroputer BASIC doesn’t come with such dire consequences for raising an error during runtime.

Note: Source code (in progress) for Retroputer BASIC is available on GitHub.

Line Transformation

This is the easiest part of the entire process. Essentially, the line that the user enters is converted to uppercase so that later processes are easier (and faster). BASIC is not sensitive to case, and so we can use that to our advantage.

print 2+2
' becomes:
PRINT 2+2

Doing this in JavaScript is easy, right?

theLine = theLine.toUpperCase();

But in assembly language, we have to be more detailed about how things get done. We need to read in a character, convert it to uppercase, and then store it somewhere.

           ld y, 0 # y is our index
_loop: ld al, [d, x, y] # [d,x] is pointer to string
           cmp al, 97 # is al (char) in range?
           brs n _continue # Not a lowercase char; continue
           cmp al, 123 # high portion
           brs !n _continue # not a lowercase char; continue
           and al, 0b1101_1111 # uppercase!
           st [d, x, y], al # store back (we modify the original)
_continue: inc y # move our index along
           cmp al, 0 # check for NULL
           brs !z _loop # No? Go back for more.

The above doesn’t quite match the same semantics as the JavaScript version. One important difference is that we now use Unicode to work with text, and so converting input from lowercase to uppercase can often be more difficult — and maybe impossible (depending on the language). Retroputer lives in the world of ASCII (rather, it’s own variation, named RetSCII), meaning that all supported characters are encoded into eight bits. This is woefully inadequate for many languages, but also true to the period.

It also means that we can use a nice feature of ASCII to convert from lowercase to uppercase. It turns out that uppercase “A” is represented with 65 in ASCII, and lowercase “a” is represented with 97. If you’re familiar with your powers-of-two, that difference should catch your eye.

So it turns out that lowercase letters are represented with a number that’s exactly 32 above the uppercase letter. Once we know something’s in range, all we need to do is subtract 32!

That works, but we could just do some bit twiddling. For Retroputer this wouldn’t actually be any faster than subtraction, but avoiding subtraction means we don’t have to worry about the carry/borrow flag during arithmetic. It turns out we can use a bitwise and to turn off the bit for the 32 place value instead.

and al, 0b1101_1111 # turn off bit in 32-place
# versus
clr c # clear carry
sub al, 32 # subtract 32

But there’s a catch: not everything can be converted to uppercase. If the user has included a string literal, for example, we have to be more careful. After all, we don’t want Retroputer BASIC to scream at the user all the time, right? (Although many computers of the era didn’t have lowercase capability, Retroputer doesn’t share that same limitation.)

For example:

print "Hello, World!"
' should become:
PRINT "Hello, World!"
' and not
PRINT "HELLO, WORLD!"

This means we need to keep track of whether or not we’re in the middle of a string literal. In BASIC, there’s only one signifier for this: the double quote. If we check if a character is a double quote, we can set a flag, and depending on the flag’s value, we can perform an uppercase operation or leave things alone.

It turns out that in JavaScript there’s no built-in to accomplish this, but we can build one:

const len = theLine.length;
let insideString = false;
for (let i = 0; i < len; i++) {
    const ch = theLine[i];
    if (ch === `"`) insideString = !insideString;
    if (!insideString) {
        const newCh = ch.toUpperCase();
        if (ch !== newCh) theLine[i] = newCh;
    }
}

Now the logic of the JS more closely matches that of the assembly version, although we’re taking advantage of JS’s unicode support a bit more.

The assembly version looks like this:

           ld y, 0 # y is our index
           ld bl, 0 # === insideString (false)
_loop: ld al, [d, x, y] # [d,x] is pointer to string
           cmp al, 34 # is al a double quote?
           brs !z check_char # no? should we uppercase it?
           xor bl, 0xFF # yes? toggle insideString
_check_char:
           cmp bl, 0xFF # inside a string?
           brs z _continue # yes? don't modify it
           cmp al, 97 # is al (char) in range? "a"
           brs n _continue # Not a lowercase char; continue
           cmp al, 123 # high portion "z"
           brs !n _continue # not a lowercase char; continue
           and al, 0b1101_1111 # uppercase!
           st [d, x, y], al # store back (we modify the original)
_continue: inc y # move our index along
           cmp al, 0 # check for NULL
           brs !z _loop # No? Go back for more.

So far all we’ve done is transform the input text to uppercase, but there’s one extra benefit here in the way we’ve had to track if we’re inside a string. We can do one round of syntax checking here!

If, at the end of the process we find that inString is still true (bl = 0xFF), we can trigger an error, because it means that there’s an unterminated string literal somewhere in the line.

Side note: It turns out many BASICs are quite lenient when it comes to terminating quotes for strings. One of many things I learned while building my own interpreter. Even so, it doesn’t feel right to me, and so Retroputer BASIC doesn’t permit it.

Tokenization

The next phase of parsing involves converting an entered line into something more efficient for Retroputer BASIC to execute. This is as close to the concept of an abstract syntax tree that we’ll get here — the result will definitely not be a tree. But it will be something that we can quickly evaluate during runtime.

One common feature of early microcomputers was a very limited memory capacity. Retroputer has more memory than most machines of the time had by default, but it still has much less than modern machines. As such, long BASIC programs could easily consume far too much memory if they were stored as the user typed them.

To save space, keywords are tokenized as the program is entered into memory. This process converts keywords into single-byte tokens. Keywords are always at least two bytes long, and so this savings can add up. It also means that we can use a lookup table during execution to call the appropriate assembly language routines.

Retroputer BASIC goes a little further than most BASICs of the time, though. It will also convert numbers to binary representations, mark strings, calculate variable references, and more. This wastes some space, to be honest, but the performance benefits (and ease of execution) help outweigh this.

So, there’s a few steps involved here:

Tokenize numbers

Numbers are converted to their binary form to avoid having to convert them each time they are encountered. For numbers encountered only once, this isn’t a huge performance benefit, but in a tight loop, this is beneficial since the number is already in a form the computer can understand.
Mark strings

Because memory is limited, if there’s a string in the code that can be used as-is, it makes sense to do so. For example, PRINT “Hello, World” can print “Hello, World” directly from the program line, rather than allocating new space, copying the string, and then printing it.

To make it easy to skip strings during execution, we also store the length of the string itself.
Search keyword table

Anything that’s not a number or a string might be a keyword — so we need to take a look through the list of keywords. This is trivial in JavaScript, but it’s not so easy in assembly language!

Once a keyword is found, the associated token is stored in program memory (instead of the entire keyword itself). This can result in significant storage savings, especially when PRINT can be reduced to a single byte!
Calculate variable pointers

Retroputer BASIC variable names are only significant to the first two characters (currently). This makes it trivial to look up a variable in an array with a fairly simple mathematical expression. Even so, this calculation takes time, and so it’d be nice if we didn’t have to do it every time we encountered the variable.

Retroputer BASIC will calculate this index and store it alongside the variable name. In addition to the variable name, it also stores the length of the variable to speed up runtime execution. This consumes a good amount of space, and so wouldn’t have been a good solution on computers with limited memory, but it works for Retroputer BASIC.

I won’t go into the assembly language for this step in this post. I’ll save that for a future post. Rest assured, though, it takes a lot of code.

Runtime Syntax Checking

Last, but definitely not least, is checking syntax at runtime. This is reasonably trivial to do once you have a tokenized representation of the code.

First, as part of the execution phase, BASIC checks if it is currently looking at a token. All tokens have the high bit set (so they have a value of 128 or higher). If a token is found, we can determine what subroutine to call simply by looking it up in a vector table. This also makes it trivial to render syntax errors — some keywords make no sense as statements, and so the vector table just points to the routine that generates a syntax error.

Once a statement’s token handler is called, the handler takes over additional parsing responsibilities. It can use gettok, gettok-raw, peektok, etc., to get and advance past tokens. If the token is something the routine didn’t expect, the routine just returns an error code. This is where both syntax and type errors are caught.

Should a statement need to evaluate an expression, another phase of parsing is performed. During expression parsing another vector lookuptable is used, which means we can catch keywords that don’t make sense inside a mathematical expression and raise the appropriate errors. For example, if you tried to enter PRINT 2+CLS, you’d get a syntax error at the CLS portion (CLS is a keyword that is short for “clear screen”).

Note: We can also determine operator precedence and number of required parameters for functions from this table. This is important for actually evaluating the expression, but we also use these to catch cases where the user may not have supplied enough arguments.

Because the token directly maps to an entry in a vector lookup table, execution can proceed pretty quickly with minimal effort. The work of parsing each kind of statement is left to the handler itself, and generally this isn’t too much of a problem. PRINT and INPUT are probably the most complex to parse, but every step is taken a token at a time.

Because a lot of checking isn’t done until runtime, it does mean that you can have partial results before an error occurs. For example:

PRINT "Hello";CLS
Hello
?Syntax Error

It also means that should your program leave the screen in a state where you can’t actually see text, you could be up a tree in terms of recovering. The syntax error is printed, but if you can’t see it… well, what are you going to do?

There’s definitely downsides to this kind of syntax checking, but it also makes for a reasonably simple interpreter as well.

Next Time

Next time we’ll talk go into a little more detail about how the second parsing phase works, and how much easier it would be in JavaScript with modern abstractions and standard libraries. But every step in this process gives me an even greater appreciation for our modern conveniences, and just how much work is going on below the surface.

Top comments (1)

Sebastian Nozzi • Nov 20 '21 • Edited

Amazing