DEV Community

Ashton Scott Snapp
Ashton Scott Snapp

Posted on

Writing an Assembler in Rust, and How I'm Reworking the Lexer (again)

Hello people! It's been a bit. Two months or so, to be specific. But I'm back, and I'm still working on this project. Obligatory Github reference:

GitHub logo AshtonSnapp / chasm

The Official Cellia Cross-Assembler for Modern Computers


Rust Build & Test

The Official Cellia Cross-Assembler for Modern Computers

Building chasm

Clone this repository to your local machine, cd into the chasm directory, and run cargo build. Simple!

And now for something you've probably heard before: I'm reworking the lexer. I'm not switching away from the logos crate, don't worry. I just realized that there's a better way to implement some of the token variants and their callbacks, and also I need to figure out how to handle in-assembly operators.

First, let's have an example - addresses. The way addresses used to work in the lexer is that there was an AddressInfo struct that contained the address type, the number base, and the value as a signed 32-bit integer. However, I realized two things: first, we don't need to remember what base the number was in, and second, we can pair the number type with the address type. So relative addresses can use an i16 while absolute addresses can use a u32 (because there's no u24 type).

This has resulted in the creation of the AddressType enum, which has each variant take an integer argument that varies in size depending on the address type. This is what it looks like in the code:

pub enum AddressType {
Enter fullscreen mode Exit fullscreen mode

Next we have the work-in-progress that is the address callback. After learning that logos only returns the text that triggered the callback, the code can be a lot simpler. Also, strip_prefix and strip_suffix are now in Rust Stable. So I don't have to use replacen. Yay!

The callback works quite simply. It takes the token slice and checks it for certain starting or ending characters. A pair of parentheses indicates an indirect address, for example. If it starts with either IP or SP, that's a relative address with the letter before the P indicating what pointer it's relative to. An ending p indicates a port address, and an ending d indicates a direct page address. Then, a $ indicates hexadecimal and a % indicates decimal. Simple.

This code is still majorly a work in progress, but I plan to commit to the GitHub repo once I get it to a certain point. Until then, y'all have an awesome day!

Discussion (0)