DEV Community


Writing an Assembler in Rust, and How I Got Stuck on the Lexer

ashtonsnapp profile image Ashton Scott Snapp Updated on ・3 min read

So, I've been writing an assembler in Rust for an architecture that hasn't been implemented yet (because I made the architecture). You can find it here:

GitHub logo AshtonSnapp / chasm

The Official Cellia Cross-Assembler for Modern Computers


The Official Cellia Cross-Assembler for Modern Computers

So far I have most of the Lexer complete. The lexer is the first part of a compiler, assembler, interpreter, anything. Your source code whatever goes into the lexer, which then goes through your code line by line and turns it into tokens. These tokens are then passed on to whatever is next, usually the parser, and the lexer's job is done.

Now, so far, the Lexer is pretty much the only thing in the assembler at the moment. It has a token struct, which contains a content string and a TokenInfo enum with immediate, address, identifier, register, operation, directive, and tab variants. The first four variants also have arguments (because that's a thing in Rust) - ImmediateInfo, AddressInfo, IdentifierType, and RegisterType respectively. ImmediateInfo and AddressInfo are structs while IdentifierType and RegisterType are enums. And then it's just enums all the way down.

Besides the actual tokens we have a bunch of helper functions: immediate(), address(), identifier(), and register(), one for each TokenInfo variant with an argument. Operations, directives, and tabs are handled in the main lexer functions. But I won't bore ya with all the details here - I linked the GitHub repository above.

One problem(?) with how I've implemented the lexer is that, while lex() is the function called by main(), lex() calls a function called tokenize_line() for each line of the source file. This function only has the context of the line it is working on, which makes things a bit difficult for things like identifiers. See, when an identifier is being defined there is context in the current line that tells it what kind of identifier it is (either a label or a symbol). This context does not exist when working with an identifier that is being used (i.e. as an argument to a directive other than the symbol definition directive, or as an argument to an operation).

I have been trying to implement a way for the lex() function to fix this, but so far my efforts have been in vain. As an alternative I could give tokenize_line() more context, in the form of a vector of known identifiers. However, that would not be adequate for label identifiers. Symbols have to be defined before they are first used, so this would work for symbols. But labels can be defined anywhere no matter where it is first used, which is why that would be inadequate.

I want to finish up with the lexer and work on something else, but if I can't get this working I can't finish the lexer. Which is why, as I said in the title, I'm stuck on the lexer.

P.S. While looking up stuff about lexers and Rust I came across a crate called Logos that makes really fast lexers. Considering I already have most of the lexer implemented I'm probably not going to refactor my code to work with this crate (because it would be a lot of code to refactor and it would make a lot of time spent working on this lexer meaningless).

Le Edit: Making some progress. When indexing a multi-dimensional array/vector, is the first index for the main array/vector or the sub-array/sub-vector?

Discussion (0)

Editor guide