DEV Community

Cover image for Simple Lexer in Rust
Opium ver. K
Opium ver. K

Posted on

Simple Lexer in Rust

Lets Get Started

This is a simple lexer written in Rust that can tokenize arithmetic expressions containing numbers and the +, -, *, and / operators.

Token

The Token enum represents the different types of tokens that can be produced by the lexer. It has five variants: Number(i32), Plus, Minus, Multiply, and Divide.

#[derive(Debug, PartialEq)]
pub enum Token {
    Number(i32),
    Plus,
    Minus,
    Multiply,
    Divide,
}
Enter fullscreen mode Exit fullscreen mode

The Number(i32) variant represents a number token and contains an integer value. The other variants represent the four arithmetic operators.

Lexer

The Lexer struct represents the lexer itself. It has one field, chars, which is an iterator over the characters of the input string.

pub struct Lexer<'a> {
    chars: Chars<'a>,
}
Enter fullscreen mode Exit fullscreen mode

The lifetime parameter 'a indicates that the lexer borrows its input string for its entire lifetime.

new

The new method creates a new instance of the lexer with a given input string.

impl<'a> Lexer<'a> {
    pub fn new(input: &'a str) -> Self {
        Lexer { chars: input.chars() }
    }
Enter fullscreen mode Exit fullscreen mode

tokenize

The tokenize method tokenizes the input string and returns a vector of tokens. It repeatedly calls the private method next_token to obtain each token until there are no more tokens left.

pub fn tokenize(&mut self) -> Vec<Token> {
        let mut tokens = Vec::new();
        while let Some(token) = self.next_token() {
            tokens.push(token);
        }
        tokens
}
Enter fullscreen mode Exit fullscreen mode

next_token

The private method next_token returns the next token from the input string or None if there are no more tokens left. It uses pattern matching on characters to determine which type of token to return.

fn next_token(&mut self) -> Option<Token> {
        let next_char = self.chars.next()?;
        match next_char {
            '+' => Some(Token::Plus),
            '-' => Some(Token::Minus),
            '*' => Some(Token::Multiply),
            '/' => Some(Token::Divide),
            '0'..='9' => {
                let mut number = next_char.to_digit(10)? as i32;
                while let Some(next_char) = self.chars.clone().next() {
                    if let Some(digit) = next_char.to_digit(10) {
                        number = number * 10 + digit as i32;
                        self.chars.next();
                    } else {
                        break;
                    }
                }
                Some(Token::Number(number))
            }
            _ => None,
        }
}
Enter fullscreen mode Exit fullscreen mode

If it encounters a character representing one of the four arithmetic operators (+, -, *, /), it returns the corresponding Token variant. If it encounters a digit character (0 to 9), it reads all subsequent digit characters to form a number and returns a Number token with that value. If it encounters any other character, it returns None.

Example

Here's an example that shows how to use the lexer to tokenize an arithmetic expression:

let mut lexer = Lexer::new("1 + 2 * 3 - 4 / 5");
let tokens = lexer.tokenize();
assert_eq!(
    tokens,
    vec![
        Token::Number(1),
        Token::Plus,
        Token::Number(2),
        Token::Multiply,
        Token::Number(3),
        Token::Minus,
        Token::Number(4),
        Token::Divide,
        Token::Number(5)
    ]
);
Enter fullscreen mode Exit fullscreen mode

This code creates a new instance of the Lexer with the input string "1 + 2 * 3 - 4 / 5", calls its tokenize method to obtain a vector of tokens, and then asserts that the resulting vector of tokens is equal to the expected value.

Is there anything else you would like to know, if so contact me at @SensoryKopi

Top comments (0)