DEV Community

Ashton Scott Snapp

Posted on Jan 2, 2021 • Edited on Jan 18, 2021

Writing an Assembler in Rust, and How I Got Stuck on the Lexer

#rust #assembler

So, I've been writing an assembler in Rust for an architecture that hasn't been implemented yet (because I made the architecture). You can find it here:

AshtonSnapp / hasm

The Official Cellia Cross-Assembler for Modern Computers

hasm

The Homebrew Assembler. Currently supporting the 16-bit Cellia architecture and the 8-bit ROCKET88 architecture.

Each architecture supported by hasm will be separated into its own module, although every architecture's assembler code will have the same general structure: you have a lexer which takes in files of assembly code and outputs streams of tokens which are fed into a parser which structures those tokens into a file syntax tree. Then the syntax trees are fed into a linker which tries to combine all of these trees into a single program tree, which is finally fed into a binary generator which does exactly what you think.

Right now I'm still trying to implement the assemblers for the two architectures I mentioned earlier, and I've only just now gotten to the parser. It's going to be a pain to write anything that's actually decently capable, but it'll be worth…

View on GitHub

So far I have most of the Lexer complete. The lexer is the first part of a compiler, assembler, interpreter, anything. Your source code whatever goes into the lexer, which then goes through your code line by line and turns it into tokens. These tokens are then passed on to whatever is next, usually the parser, and the lexer's job is done.

Now, so far, the Lexer is pretty much the only thing in the assembler at the moment. It has a token struct, which contains a content string and a TokenInfo enum with immediate, address, identifier, register, operation, directive, and tab variants. The first four variants also have arguments (because that's a thing in Rust) - ImmediateInfo, AddressInfo, IdentifierType, and RegisterType respectively. ImmediateInfo and AddressInfo are structs while IdentifierType and RegisterType are enums. And then it's just enums all the way down.

Besides the actual tokens we have a bunch of helper functions: immediate(), address(), identifier(), and register(), one for each TokenInfo variant with an argument. Operations, directives, and tabs are handled in the main lexer functions. But I won't bore ya with all the details here - I linked the GitHub repository above.

One problem(?) with how I've implemented the lexer is that, while lex() is the function called by main(), lex() calls a function called tokenize_line() for each line of the source file. This function only has the context of the line it is working on, which makes things a bit difficult for things like identifiers. See, when an identifier is being defined there is context in the current line that tells it what kind of identifier it is (either a label or a symbol). This context does not exist when working with an identifier that is being used (i.e. as an argument to a directive other than the symbol definition directive, or as an argument to an operation).

I have been trying to implement a way for the lex() function to fix this, but so far my efforts have been in vain. As an alternative I could give tokenize_line() more context, in the form of a vector of known identifiers. However, that would not be adequate for label identifiers. Symbols have to be defined before they are first used, so this would work for symbols. But labels can be defined anywhere no matter where it is first used, which is why that would be inadequate.

I want to finish up with the lexer and work on something else, but if I can't get this working I can't finish the lexer. Which is why, as I said in the title, I'm stuck on the lexer.

P.S. While looking up stuff about lexers and Rust I came across a crate called Logos that makes really fast lexers. Considering I already have most of the lexer implemented I'm probably not going to refactor my code to work with this crate (because it would be a lot of code to refactor and it would make a lot of time spent working on this lexer meaningless).

Le Edit: Making some progress. When indexing a multi-dimensional array/vector, is the first index for the main array/vector or the sub-array/sub-vector?