DEV Community

Ashton Scott Snapp
Ashton Scott Snapp

Posted on

Writing an Assembler in Rust, and How I'm Redoing the Lexer

I've continued work on the assembler I've been working on. I finished the Lexer, got it to compile, ran it, tested it, and found that the lexer just did not work. Luckily, I found a crate called logos that helps you make a fast lexer, which I'm using in order to re-do the lexer.

GitHub logo AshtonSnapp / hasm

The Official Cellia Cross-Assembler for Modern Computers

hasm

Rust Build & Test

The Homebrew Assembler. Currently supporting the 16-bit Cellia architecture and the 8-bit ROCKET88 architecture.

Each architecture supported by hasm will be separated into its own module, although every architecture's assembler code will have the same general structure: you have a lexer which takes in files of assembly code and outputs streams of tokens which are fed into a parser which structures those tokens into a file syntax tree. Then the syntax trees are fed into a linker which tries to combine all of these trees into a single program tree, which is finally fed into a binary generator which does exactly what you think.

Right now I'm still trying to implement the assemblers for the two architectures I mentioned earlier, and I've only just now gotten to the parser. It's going to be a pain to write anything that's actually decently capable, but it'll be worth…

As of right now, I am writing some callback functions. Specifically, I'm writing the one responsible for handling character immediates (immediate in the sense that the processor doesn't have to fetch an address then fetch the value from that address, it just has to fetch the value). This is being done via a giant match statement. Here's basically what the code for this function looks like right now:

fn char(lex: &mut Lexer<Token>) -> Result<u8, ()> {
    let slice: &str = lex.slice();

    let poss_char: &str = &slice[slice.len() - 2];

    // Welcome to hell.

    match poss_char {
        "\x00" => Ok(0),
        "\x01" => Ok(1),
        "\x02" => Ok(2),
        "\x03" => Ok(3),
        "\x04" => Ok(4),
        "\x05" => Ok(5),
        "\x06" => Ok(6),
        "\x07" => Ok(7),
        "\x08" => Ok(8),
        "\x09" => Ok(9),
        "\x0A" => Ok(10),
        "\x0B" => Ok(11),
        "\x0C" => Ok(12),
        "\x0D" => Ok(13),
        "\x0E" => Ok(14),
        "\x0F" => Ok(15),
        "\x10" => Ok(16),
        "\x11" => Ok(17),
        "\x12" => Ok(18),
        "\x13" => Ok(19),
        "\x14" => Ok(20),
        "\x15" => Ok(21),
        "\x16" => Ok(22),
        "\x17" => Ok(23),
        "\x18" => Ok(24),
        "\x19" => Ok(25),
        "\x1A" => Ok(26),
        "\x1B" => Ok(27),
        "\x1C" => Ok(28),
        "\x1D" => Ok(29),
        "\x1E" => Ok(30),
        "\x1F" => Ok(31),
        "\x20" => Ok(32),
        "\x21" => Ok(33),
        "\x22" => Ok(34),
        "\x23" => Ok(35),
        "\x24" => Ok(36),
        "\x25" => Ok(37),
        "\x26" => Ok(38),
        "\x27" => Ok(39),
        "\x28" => Ok(40),
        "\x29" => Ok(41),
        "\x2A" => Ok(42),
        "\x2B" => Ok(43),
        "\x2C" => Ok(44),
        "\x2D" => Ok(45),
        "\x2E" => Ok(46),
        "\x2F" => Ok(47),
        "\x30" => Ok(48),
        "\x31" => Ok(49),
        "\x32" => Ok(50),
        "\x33" => Ok(51),
        "\x34" => Ok(52),
        "\x35" => Ok(53),
        "\x36" => Ok(54),
        "\x37" => Ok(55),
        "\x38" => Ok(56),
        "\x39" => Ok(57),
        "\x3A" => Ok(58),
        "\x3B" => Ok(59),
        "\x3C" => Ok(60),
        "\x3D" => Ok(61),
        "\x3E" => Ok(62),
        "\x3F" => Ok(63),
        "\x40" => Ok(64),
        "\x41" => Ok(65),
        "\x42" => Ok(66),
        "\x43" => Ok(67),
        "\x44" => Ok(68),
        "\x45" => Ok(69),
        "\x46" => Ok(70),
        "\x47" => Ok(71),
        "\x48" => Ok(72),
        "\x49" => Ok(73),
        "\x4A" => Ok(74),
        "\x4B" => Ok(75),
        "\x4C" => Ok(76),
        "\x4D" => Ok(77),
        "\x4E" => Ok(78),
        "\x4F" => Ok(79),
        "\x50" => Ok(80),
        "\x51" => Ok(81),
        "\x52" => Ok(82),
        "\x53" => Ok(83),
        "\x54" => Ok(84),
        "\x55" => Ok(85),
        "\x56" => Ok(86),
        "\x57" => Ok(87),
        "\x58" => Ok(88),
        "\x59" => Ok(89),
        "\x5A" => Ok(90),
        "\x5B" => Ok(91),
        "\x5C" => Ok(92),
        "\x5D" => Ok(93),
        "\x5E" => Ok(94),
        "\x5F" => Ok(95),
        "\x60" => Ok(96),
        "\x61" => Ok(97),
        "\x62" => Ok(98),
        "\x63" => Ok(99),
        "\x64" => Ok(100),
        "\x65" => Ok(101),
        "\x66" => Ok(102),
        "\x67" => Ok(103),
        "\x68" => Ok(104),
        "\x69" => Ok(105),
        "\x6A" => Ok(106),
        "\x6B" => Ok(107),
        "\x6C" => Ok(108),
        "\x6D" => Ok(109),
        "\x6E" => Ok(110),
        "\x6F" => Ok(111),
        "\x70" => Ok(112),
        "\x71" => Ok(113),
        "\x72" => Ok(114),
        "\x73" => Ok(115),
        "\x74" => Ok(116),
        "\x75" => Ok(117),
        "\x76" => Ok(118),
        "\x77" => Ok(119),
        "\x78" => Ok(120),
        "\x79" => Ok(121),
        "\x7A" => Ok(122),
        "\x7B" => Ok(123),
        "\x7C" => Ok(124),
        "\x7D" => Ok(125),
        "\x7E" => Ok(126),
        "\x7F" => Ok(127),
        _ => Err(())
    }
}
Enter fullscreen mode Exit fullscreen mode

Yes. I had to write all of that. Because I can't really guarantee that whoever's using the assembler has Unicode support in their program. That whole function was painful to write. At least now, all I have to write in terms of callback functions are the ones for character escape sequences, strings, addresses, immediates, identifiers (labels and symbols), and actual instruction mnemonics. (Also, need to stop trying to Ctrl+S while using a browser)

Top comments (1)

Collapse
 
ashtonsnapp profile image
Ashton Scott Snapp

Quick edit notice: all character related callbacks are complete - the char function in this post was edited to reflect its current state, and char_esc_hex is now hell 2: electric boogaloo since it now contains all 256 different char values.