MixusMinimax

Posted on Feb 21, 2022

FALSE compiler in .NET

#esoteric #forth #false #dotnet

The Language False

While not really all that useful, esoteric languages are just fun to play around with, especially when you are part of a small group able to see meaning in a garbled mess of letters.

I'm sure most people have heard of Forth, a stack-oriented programming language from 1970. Operations would pop an amount of elements from the stack, execute some kind of computation, and push the result. Even control structures use the stack.

False is very much similar, except that the language is based on individual characters instead of control words. Even whitespaces can be omitted, making False incredibly compact and unreadable to the uninitiated.

The rules can be read on strlen.com.

As a short summary:

Integer literals push onto the stack. Arithmetic and logic operations pop and push from and to the stack.

Elements on the stack can be stored into and loaded from one of 26 variables.

Most importantly, instructions in square brackets push a lambda onto the stack, which can be stored and/or executed. These lambdas are also used for conditional execution and while loops.

Writing a Compiler

Since the language is so simple, it offers itself as the perfect starting point to dive into compilers. I chose to write it in .NET, as I wanted to play around with C# for a while.

The project consists of a library containing logic, and a Cli project. The Cli registers the services defined in the logic library, so the library itself does not have any dependencies, DI is done from the Cli.

The main components are:

CodeParser.cs
Interpreter.cs
Compiler.cs

Lexing and Parsing

Since False is character-based, the job of CodeParser.cs is incredibly simple, the only "challenges" (not really that hard) were string literals, character literals, comments, and lambdas.

The resulting object contains a dictionary of lambdas (which are just lists of commands), and a dictionary of string literals.

Interpreting

Interpreters of stack-based languages are incredibly simple, especially since most languages and frameworks already have implementations for the necessary data structures, namely the stack.

For example, the + command simply is

stack.Push(stack.Pop() + stack.Pop());

and other commands work in similar fashion.

Control state consists of two values: Program counter and current lambda.

One interesting thing about lambdas is, that there are no arguments or return values. Lambdas simply modify the stack. This means I can not use the same stack for values and for control structures, namely while.

Lambdas can also be executed in a nested fashion, requiring the interpreter to keep track of the lower lambdas and return addresses in the second stack.

While loops pop two lambdas, one for the condition and one for the body. Those are then stored inside a separate stack, since while loops can be nested.

Output needs to be buffered in C# directly, since control characters can not be printed one-by-one. UTF-8 control sequences need to be converted to a printable string at once, so buffering is done on a line-by-line basis.

Type Safety

My interpreter and compiler have configurable type safety, as any sort of safety comes with run-time and memory overhead. The possible types of stack elements are

Number
Reference
Lambda

And the safety levels are:

None
Lambda
Full

For the latter two options, the type of the stack element is pushed onto a separate stack. This is done to avoid messing up alignment of the data stack, since types are just a single byte each.

Full type safety is self-explanatory, the requested type when popping from the stack is verified at run-time, and an exception is thrown if the type is wrong.

However, this approach might be overkill, since using a number as a reference or reference as a number is harmless, especially since references are masked indices into the variable buffer. The important one is Lambdas, especially for the compiler. There, lambdas are just an address to jump to, so one could simply jump to any address if type safety was not enforced.

This is what the second safety level offers: Only validation when trying to pop a lambda from the stack. This often is a good balance between safety and efficiency.

Compiling

I set myself the challenge to create completely standalone assembly, without the use of glibc or llvm. Part of this project was the opportunity to learn about assembly, and Linux system calls.

Sure, transpiling everything to C and simply compiling it with gcc would result in a faster program, but doing that would defeat the entire purpose of this experience.

Compiling False in itself is not that hard, the difficulty comes from trying to make the resulting code fast. Regarding that, a couple of challenges arise:

Memory management
Avoiding redundant instructions, especially loads and stores
Type safety
Buffering I/O

Memory

As with the interpreter, the data stack needs to be kept separate from the actual stack. Lambdas are just function calls, and since lambdas expect the stack to be unmodified, the standard stack (rsp) can not be used.

The stack is allocated using the mmap syscall, with configurable size. The stack grows upwards in memory, since it does not need to approach the heap like the normal stack. Also, untouched memory pages are not actually allocated by the Linux kernel after mmap, making smaller False programs automatically use less memory.

This memory region is automatically freed on program exit, so an explicit unmap system call is unnecessary.

Redundant Instructions

Initially, the stack base and stack index were stored in .data and .bss, respectively. Every pop from the False stack would load these values, potentially modify and store the stack index. The plan was to remove these instructions automatically when writing an optimizer, but I simply decided to use the registers R12-R14 for these values (the type stack also has its own base, more on this later).

This essentially halved execution time. I have to say, seeing improvements like these is incredibly satisfying.

Type Safety

Much like the interpreter, the compiled code maintains a separate stack for types, which is also allocated using mmap. Depending on the safety level, the requested type is validated on pops, and an appropriate error message is printed on failure.

Especially here, where speed is more relevant, I would never use full type safety, with the exception of testing.

Buffering I/O

As we all know, system calls are incredibly expensive. Even fast-syscall takes its sweet time to finish. This is where the real optimization lies, I saw a run-time reduction of 11x in a program that mostly just prints characters.

The difficulty stems from the fact that false has three separate ways to print to the console:

,: Print the top of the stack as a character
.: Print the decimal representation
"foo": Print a string constant to the console

To keep myself short, stdout buffering simply maintains a string buffer that gets filled by all three sources, and flushed when full. It gets complicated when the string literal or decimal number doesn't fit into the remaining space of the buffer, or is longer than the buffer itself, but I won't get into that here. Look at the source code if you're interested.

Closing Thoughts

This project was a lot of fun, and there are still things left to be done, for example optimizing the false code itself. Code like 1 2+ could simply be translated to 3, but writing a general/flexible optimizer that works for all sorts of edge cases poses quite some effort to implement, so I'll leave that one for the future.

I think I have achieved my goal here, which was to learn about compilers, assembly language, and the Linux kernel; I should know enough to tackle a more complex language. Maybe one that has expression trees instead of being stack-based, we'll see.

If you've read this far, thank you! And I'm sure you're just procrastinating, I know it well. In fact, the time I spent on this project could have gone to Uni work.

Links

Syntax Highlighting: vscode/MixusMinimax.false

This also works in Jetbrains's products.

False Interpreter/Compiler:

MixusMinimax / falsedotnet

FALSE Interpreter and Compiler written in .NET

FALSE.NET

FALSE Interpreter and Compiler written in .NET. Compilation only works for Linux binaries, however, you can run the compiler on Windows (it uses bash.exe for assembling and linking).

If you don't have bash.exe, you can also just create the assembly file.

Both the interpreter and compiler have configurable type safety. There are three levels of safety: {NONE, LAMBDA, FULL}.

Read Wouter van Oortmerssen's (The creator of FALSE) website for more: strlen.com/false-language

Also, the esolang wiki: esolang.org/wiki/FALSE

Quick Start

cd into FalseDotNet.Cli.

$ dotnet run help
FalseDotNet.Cli 0.0.8
Copyright (C) 2022 FalseDotNet.Cli

  compile      compile FALSE code.

  interpret    interpret FALSE code.

  help         Display more information on a specific command.

  version      Display version information.

Interpret a program:

cd into FalseDotNet.Cli.

Example:

dotnet run interpret ..\samples\simple.f

Usage:

$ dotnet run help interpret
FalseDotNet.Cli 0.0.8
Copyright (C) 2022 FalseDotNet.Cli
  -i, --input               Read from file instead

…

View on GitHub

DEV Community

FALSE compiler in .NET

The Language False

As a short summary:

Writing a Compiler

Lexing and Parsing

Interpreting

Type Safety

Compiling

Memory

Redundant Instructions

Type Safety

Buffering I/O

Closing Thoughts

Links

MixusMinimax / falsedotnet

FALSE Interpreter and Compiler written in .NET

FALSE.NET

Quick Start

Interpret a program:

Top comments (0)

Read next

Exploring Azure Queue Storage in .NET

Managing Background Jobs with Hangfire in .NET

C# Design Pattern: Flyweight

Safe cleaning NuGet cache.