A while ago I posted a blog post mentioning I was working on a programming language. Ever since then some of you were kind enough to reach out. So here is an update on things I've been implementing since then, and then an overview of the tools.
An overview of the language
Rumi is a low-level language, it supports functions, structs, pointers, arrays, and most other basic types there are. A typical program might look like this:
import base
MyStruct: struct{
id: int = 1;
}
MyStruct.print := () -> void{
printf("MyStruct with id %d\n", self.id);
}
main :=()->int{
printf("Hello world!\n");
m: MyStruct;
m.print();
return 0;
}
(Sorry for the color scheme, it was the best I could achieve)
And there are two different executables, a compiler rum
, and an interpreter rumi
(rum interpreter!). So far, it looks normal, but we have a few different things under our sleeves:
import base
import compiler
rem := (a: int, b:int)->int{
// returns 0 if a is dividable by b, otherwise the remainder
// Pretend this is hard to calculate!
return a % b;
}
main := () -> int{
printf("This would print in runtime\n");
printf("8 % 4 == %d\n", rem(8,4));
return 0;
}
@compile
rem_test := ()->int{
printf("Running tests for rem\n"); // This will run in compile time
if(rem(8,4)){
printf("rem(8,4) should be 0\n");
return 2; // returning a value higher than 1 results in compile error
}
return 0; // This tells the compiler everything is fine
}
@compile
rem_test2 := () -> int{
printf("Incomplete test, TODO\n"); // This will run in compile time
return 1; // This will print a compile warning
}
Running the above code, I can get:
$ rum test.rum # this is compile only
Running tests for rem
Incomplete test, TODO
Compile directive compile exited with value 1 on line 35
$ rumi test.rum # this is compile and execution
Running tests for rem
Incomplete test, TODO
Compile directive compile exited with value 1 on line 35
This would print in runtime
8 % 4 == 0
$ ./a.out # this is execution only
This would print in runtime
8 % 4 == 0
As you can probably guess, the @compile
directive tells the compiler to run that function in run time. I demonstrated how to run tests during compile time, but many other things could be done too. Let's look at another example (this one is not pushed to github yet, as I'm still experimenting with it)
@compile
make_verbose := (c: *compile) -> int {
printf("making main verbose\n");
// Get the main function
main_f = c.getfunction("main");
// A local function that gets a function call and inserts another function call before it
add_runtime_print := (fc: *function_call) -> void{
fc.insert_before(c.fcall(printf, "calling %s in main\n", fc.name));
}
// map all function calls to the previous function
main_f.function_calls.map(add_runtime_print);
return 0;
}
The possibilities are endless! But enough about the language, for now, let's look at the tools that made this possible...
Compile Chain
Looking from an academic perspective, a compiler is just a set of mapping tools, that converts a stream of strings to a stream of instructions. But digging deeper, this is the compile chain that you see in most compilers (including rumi):
Stream of characters to stream of tokens (lexing with flex)
Stream of characters here means the source code. If you think about it, any source code is just a stream of characters to a computer. But creating instructions from this stream is a hard job, so what should we do?
We use lexers to tokenize this stream. We convert payment = salary * work_hours
to the token stream ID
=
ID
*
ID
. We don't care if the syntax is correct, or what the meaning of any of these tokens are at this stage, all we care about is converting these characters to the tokens. I used flex
to get this done, and the source code of the lexer is available at this file.
flex
converts the l
file into a c program that provides a few functions that are useful in the next part...
Stream of tokens to Abstract Syntax Tree (AST)
Now that we have our token stream, we can think about the syntax. This is where we define how functions are declared, how new variables are created, etc. I used gnu bison
which gets rules like this:
while_stmt
: WHILE '(' expr ')' stmt {$$=new WhileStatement($3, $5);}
;
where WHILE
is a token, expr
and stmt
are other rules defined in the parser (full source code is here), and finally, we define how the AST is created.
AST is a syntax tree, for each different statement and expression we create a different structure (or class in my case), where we just store the data. We are not doing any calculations here, just storing it for later. All this statement is saying is that a while statement has a while token, followed by an open parenthesis, followed by an expression, a close parenthesis, and then another statement(which can be a code block!).
Let's look at another example, this is how variable declarations are handled:
variable_decl
: ID DEFINE_AND_ASSIGN expr ';' {$$=new VariableDecl($1, NULL, $3);}
| ID ':' type ';' {$$=new VariableDecl($1, $3);}
| ID ':' type '=' expr ';' {$$=new VariableDecl($1,$3,$5);}
;
You can see that there are three different rules here, one for cases like id := 3;
, one for cases like age : int;
and finally, for name : string = unnamed;
. You can look at the full source code for the parser here.
bison
, similar to flex
, generated c (or c++) code for us. The good news is bison
works flawlessly with flex
, and just calls it whenever it is looking for the next token.
AST to code gen
After the parser has finished, we can go and look at the AST, and start generating the code. Some compilers, including rumi, have a stage that go through and resolve types, hande sizes, etc, to make the code generating go smoother. I'm not going to cover that step here, you can look at the compiler.cpp
code in the repo for more details.
Now, we can finally get into code generation. Fortunately for us, llvm
and gcc
both have quite powerful almost high-level assembly-like languages. I chose llvm
for various reasons, but gcc
would have worked similarly. These languages are assembly-like, meaning they are not exactly assembly, and they are not limited to a machine. llvm
provides utilities to optimize the llvm-ir
code (the assembly we mentioned!), to generate object files, and to even run them on command. Our previous AST would be converted to something like this:
define private i64 @successor(i64 %a) {
entry:
%returnval = alloca i64
%a1 = alloca i64
store i64 %a, i64* %a1
%readtmp = load i64, i64* %a1
%0 = add i64 %readtmp, 1
store i64 %0, i64* %returnval
br label %end
end: ; preds = %entry
%1 = load i64, i64* %returnval
ret i64 %1
}
As you can imagine, this is a pretty unoptimized version of saying
successor := (a:int)->int{
return a + 1;
}
But the good news is, llvm handles optimization and can convert that to a really nice optimized assembly version (and then to even nicer machine code).
But how do you know what to generate? The I use clang test.c -S -emit-llvm -c
to get the llvm-ir of some code, and then figure out how to output that using llvm's IRBuilder
.
Let me know what you think about all this, all comments and ideas are welcome!
Top comments (2)
It compiles to machine code
Where are the flex and bison files, it shows a 404 error on github