Understanding bytecode interpreters

#bytecode #vm

Have you ever wondered how Java or Python work? What is a "virtual machine"? If those questions stir your interest, you've come to the right place!

If everything goes according to the plan, this will be part of a serie of articles on understanding and making bytecode interpreters (we will even see a bit of how you could compile a language to a bytecode).

Why should I care about VMs?

A virtual machine enables high portability: compile once, run everywhere.

You can get performances nearly equal to native programs ones, compared to tree-walk interpreters which have to hop from a node to another, oftentimes not contiguous ones, leading to really poor performances.

Also, it's a lot of fun to make one, because we will be creating our own instructions set (which can be seen as an assembly language) and creating our own "virtual CPU" to run our instructions set. In a way, making a VM is the same as making a small specialized virtual computer!

Basic concepts

A virtual machine (or bytecode interpreter) works on a linear instruction's flow. A "program" being linear gives us way better performances compared to non-linear one, because we don't have to fetch data 5KiB after the current one, then 2MiB before... the CPU cache will love it, as it will be more predictible.

Also, with such type of programs, we have only two things we can do:

execute current instruction, jump to the next one ;
jump to an arbitrary position in the bytecode (useful for loops and conditions)

Instructions

In an high-level language, an instruction would be something like if (condition) then .... Way too verbose for a bytecode, we want to keep it small. In a bytecode, we would have something like PUSH 5; PUSH 4; EQ; POP_JUMP_IF_TRUE 15 (compare 4 and 5, jump to address 15 if the comparison was truthy), with each uppercase letters word being an instruction.

Nota bene: instructions take a fixed number of arguments, so that when we read an instruction like PUSH we know how much we should continue to read from the program before moving onto the next instruction.

To make that even smaller, we could write P5 P4 E PJIT15. We are still wasting space, and fetching an arbitrary length instruction (how do we know we need to read a single char like P or PJIT?) isn't efficient. We could write bytes, with a byte per instruction.

Instruction name	Value
PUSH	0x01
EQ	0x02
POP_JUMP_IF_TRUE	0x03

Our code will look like this now: \x01\x00\x05\x01\x00\x04\x02\x03\x00\x0f! (Assuming we write numbers on a fixed number of bytes, here 2)

Let's break this down:

\x01 \x00\x05   ->   PUSH 0x0005
\x01 \x00\x04   ->   PUSH 0x0004
\x02            ->   EQ
\x03 \x00\x0f   ->   POP_JUMP_IF_TRUE 0x000f

Now you may wonder how EQ works, and why we have PUSH and POP_... instructions. For this short example I assumed we were using a stack based virtual machine. Basically, we have a LIFO stack (last in, first out) on which we put values, and instructions can use this stack to retrieve values (pop them off), do stuff with them, and push back (or not) a result to the stack.

The stack

As specified before, it is a special finite space on which we put our values to operate on them later, for example:

adding two numbers
calling a user-defined function
calling a builtin function
...

We can use the stack to make temporary values, for example in a = 4 + 5 * 12, 5 * 12 would be a temporary, same goes for the 4 + 5 * 12 until it is popped and stored in the variable a.

To operate on this stack with more freedom, we can introduce special instructions such as DUP to duplicate the top of the stack value, and ROT to rotate the two values on top of the stack.

Source: craftinginterpreters.com

Registers?

Every CPU has a finite set of registers, and since our VM is inspired from the architecture of computers, could we (or should we) have registers in our implementation? What would be the benefits?

A stack based VM will need to pop and push very often to its stack in the context of recursive functions, while a register based one can be more easily optimized using a JIT (just in time) compiler, because our computers are register based, making the transformation quite "straightforward".

Generating code for a stack based VM is easier than for a register based one, because the stack is very large and we don't need to keep track of which register is occupied. Also, generated bytecode for stack based VM are smaller than register based ones because our instructions are often on a single byte, we don't need to add information about the registers used per instruction. However, even with larger instructions (leading to longer load time per instruction), register based VM are often faster than stack based ones (even without JIT compilers).

For a more in-depth analysis, go check this stackoverflow thread.

In this serie, we will only target stack based VM because they are easier to use, and while we aim for performance, top performance / near native performances aren't worth a shot here. Near native performances would take us years to achieve, and a lot of micro benchmarks and micro optimizations (that's still doable of course, but not our goal here).

From language to bytecode

Let's take this small snippet from an imaginary language:

let a = 4 + 5 * 12
let foo = (a, b, c) {
    let d = a + b
    print(d - c, c)
    if (d > 0)
        return 2 * a + b * b - 4 * a * c
    else
        return 0
}
foo(a, 2 * a, 3 * a)

If we break this down, we have:

arithmetic operations (+, -, *)
variable declarations
function declarations, with arguments
conditions
return operations
builtins calls (print)
user's functions calls

In term of instructions, we could go with this:

Name	Job	Value
ADD	Add 2 values from the stack and push the result	0x01
SUB	Substract 2 values from the stack and push the result	0x02
MUL	Multiply 2 values from the stack and push the result	0x03
STORE <name>	Store the top of the stack in the given name	0x04
LOAD <name>	Load the variable name from the current scope and push it on the stack	0x05
GT	Push the result of TS > TS1 on the stack (the 2 values are popped)	0x06
RET	Quit the current function with the last value on top of the stack	0x07
POP_JUMP_IF_TRUE <dest>	Pop the top of the stack, if it's truthy jump to the absolute bytecode address dest, otherwise do nothing	0x08
JUMP <dest>	Jump to the given absolute bytecode address	0x09
CALL <argc>	Call TS with argc arguments from the stack	0x0a

A lot is going on here:

we have STORE and LOAD both taking a variable name, implying the existence of a scoping system to keep track of variables. It also implies that we have a way to put a variable name after a fixed length instruction
we have RET to quit the context of a function and go back to previous one, implying that we have a way to keep track of function calls context/history (same goes for CALL)

To solve the problem of fixed length variable names we will introduce a symbol table at the top of our bytecode, keeping track of all the variable names declared in a program. This way, we can reference every variables by a fixed length number! Our symbol table for this code could look like this:

Symbol	ID
a	0
foo	1
b	2
c	3
d	4
print	5

A small problem arise with print: it's a builtin symbol, thus having it at a random symbol id isn't practical. A solution could be to have it with always the same identifier, and all the other user-defined symbols would start after. Now our symbol table will be like that:

Symbol	ID
a	1
foo	2
b	3
c	4
d	5

Notice that print isn't listed in it anymore! Our VM will have to know that every symbol identifier greater or equal to 1 is user-defined, and the ones before are builtins' ones.

As for the context tracking for functions' calls, we will see that in the next article. ;)

Originally from lexp.lt