Building A Virtual Machine

#cpp #virtualmachine #bytecode #language

Over the last few months I've been making a virtual machine I'm calling Nabla, and I wanted to share some of what I've done.

The Nabla VM is a virtual machine I designed with an idea that started with wanting to make a programming language. I wanted it to be higher level, and I didn't want to just define a grammar and plug it into LLVM. I wanted to really MAKE the language. When I mean "make" I mean, I wanted to write things from top to bottom, and I wanted that bottom to be special. What do I mean by special? I mean that I wanted the bottom-most layer of the language to be an executable byte code so I wouldn't have to target different platforms. I figured making a VM that executed byte code aught to be easy enough.

Stack Based, or Register Based?

As I started to investigate what the world was doing with VMs I noticed that there were two primary types of VMs that languages leveraged to perform their operations. Stack based, and Register based. There is an article here talking about the two, but basically it comes down to the means by which you move data around. Do you use a stack, or pre-defined locations for storage (registers). I only thought about this for about 5 minutes before deciding that what I wanted was something... more similar to actual computer architecture. Now I'm not claiming that the Nabla VM represents 1-for-1 physical computer architecture, but it is definitely more similar to that than a simple stack. Realistically, while designing the VM around this register idea, it is kind of pointless to make a distinction when it comes to the Nabla VM. It does in fact use registers, but it also leverages stacks for computation... as do most things.

So what does this Architecture look like ?

I'm glad you asked. The Nabla VM was built around a concept I've wanted to play with for a while. The concept of physical distinction between data and instructions. Usually when you have assembly code everything is stored in the same place and executed sequentially. Instructions and data are in the same area so fun things like stack smashing / stack overflow can allow mischievous users to overwrite code in memory and execute their own routines. With a physical distinction between data and instructions, we can mitigate some (only some, the idea is not bullet proof) of the attack vectors that people could use to poke and prod around on a machine.

Now, being up front, the Nabla VM isn't meant to be secure or anything fancy like that. I just wanted to take that idea and have it influence my design. I realize that the VM exists on a regular machine and the distinction I'm making between instructions is just a facade.

I decided that there would be physical areas that separated out functionality of code such-that these areas' data could even be separated. Now, in order for these distinct things to communicate they needed methods to transfer information. As discussed earlier, I decided to go with a register based VM, so obviously there are some registers, but what about big swaths of data? With this in mind I decided to add a global area for shared data storage. Wild! Making things even more crazy, I added stacks to each of the function units so they could process data without leveraging the storage stack for arithmetic calculations. This will make sense later on when I start to cover the pseudo-parallelism that the VM can do.

So what does this thing look like ?

Stunning!

I gave the machine 16 registers, a global stack (GS), function units (FU), and stacks local to a function unit (LS).

Instruction Picking

So now we have a box. With boxes in it, and in some cases, another layer of boxes. Now we just need a stick to poke them with and make them do things. This stick of course is an instruction set. Now when it comes to instruction sets there are a few overall ideas called RISC CISC and MISC that I got to choose from. What a world! These stand for "Reduces Instruction Set Computing," "Complex Instruction Set Computing," and "Minimal Instruction Set Computing." The gist of what these mean is that I can either have a literal metric ton of instructions (CISC), a reasonable amount of instructions (RISC), or barely enough to do anything more than basic computation and we'll have to to tons of execution do do literally anything (MISC). My descriptions here are clearly biased by my opinion, and it should be clear that I chose RISC. Mostly because I didn't want to write all of the instructions, but also because lots of types of instructions in a virtual environment would have a detrimental impact on the execution cycle (it would be slower).

Armed with the idea that I wanted a 'reasonable' amount of instructions I started writing out the things I would need to facilitate computation in / with the boxes above. As I started the project I realized that I also needed to decide that weather or not I would have fixed-width or variable width instructions. There are pros and cons to both, but ultimately I decided to go with fixed-width 64-bit instructions (for execution.) Why 64-bit? Primarily because I could stuff a whole lot more information directly into the instruction set with 64-bits than say, with 8, 16, or 32 bits. Why not 128 / 512 ? Because thats ridiculous, and because I didn't need that much space. A lot more space would have been wasted.

The Instruction Set

The instruction set documentation can be found here. I don't want to go into too much detail about the instruction set here, but this is an example of the ASM that the Nabla VM can run:

.file "Example"
.init main

.int8 my_gs_int 24 ; Special instruction to load a value into GS on init

; This 'function' represents one of the yellow boxes above
<main:

    mov r0 $42     ; Load 42 into register 0
    pushw ls r0.   ; Push the entire register into the local stack
    pushw gs r0.   ; Push the entire register into the global stack
    ret
>

As you can see, the instruction set specifies things like which of the boxes above to store data, but what you don't see is that the instructions themselves reside outside of the data storage areas! You can't access the instructions at run time at all. Beautiful.

What about File IO and Networking?

Good question. There is no use in a programming language if you can't load data to/from disk, or chat online with it. For these things I made it possible to extend the VM using devices. The use of these get a bit heady, but in brief, they have their own instruction set (like drivers) that need to be formed within the ASM its self. What do I mean? The byte code that the devices actually executes has to be written in the Nabla ASM. Right now the devices support basic TCP/UDP and IO to disk / stdin and stdout. The networking portion is in line to be beefed up once I get the high level language developed to a workable state. WAIT. Did I just say high level language? Good transition time.

High Level Language

As I mentioned in the start of this post, I wanted to make a programming language. I guess I kind of did by making the assembly language, but that was more of a means to an end. The end being the high level language I wanted to make. Now, I didn't have an idea of what I wanted the language to look like or feel like. I just wanted to do it. So thats what I'm doing. The language "Del" is currently being developed. Its pretty primitive right now, but here is an example of what it looks like :

def add_one(int some_int) -> int {

    return some_int + 1;
}

def main() -> int {

    int my_int = add_one(41);

    return 0;
}

It isn't too pretty yet, but its getting there. As of the writing of this post the language compiles to NablaVM's ASM code and can support ints, doubles, and chars for types, a few different types of loops, and there are some preprocessor directives to import other files.

Right now everything is under heavy development, but you can see the project at its project page here.

Top comments (2)

Josef Biehler • Jun 26 '20

I have bought me the book "Writing an interpreter in go" and"writing a compiler in go" to get an insight into that stuff. I am pretty excited! Maybe we can read more of your work here on dev.to?

Bosley • Jun 27 '20

I've been wanting to get those books for a while, I've heard good things about them! I'll definitely update here on dev.to as the project progresses.