DEV Community

Cover image for Lexer - Expression calculator
Pavel Kutáč
Pavel Kutáč

Posted on

Lexer - Expression calculator

Smart calculators are everywhere. You can type the whole expression and it will count with proper priorities and handles parentheses. In this article and serie is described how to do your own smart calculator.

🇨🇿 V češtině si lze článek přečíst na kutac.cz


There are many possibilities how to solve this problem. But I will focus on implementation with Lexer, which tokenizes input, and parser, which converts tokens into Abstract Syntax Tree.

🔗 The whole code can be found on GitHub in arxeiss/go-expression-calculator repository.

Lexer and tokens

Lexer performs lexical analysis, which is the process of converting input string into the list of tokens. Each token has its own meaning and can contain a value. The first step is to define all tokens which we will need.

type TokenType uint8

const (
    EOL TokenType = iota // End Of Line
    Unknown
    LPar
    RPar
    Exponent
    Multiplication
    Division
    FloorDiv
    Modulus
    Addition
    Substraction
    Number
    Identifier
    Whitespace
)

type Token struct {
    tType            TokenType
    value            float64
    idName           string
}
Enter fullscreen mode Exit fullscreen mode

"Cheating" with regular expression

The lexer can be implemented with finite automata. But it can be quite complicated, so I used regular expression. The final regex is in the code and also in regex101.com. The main reason I decided to use regular expression is to support multiple number formats. Like 123, 5.48, .13, 71.41e2, 9.731e+3, 4.43e-4 and all other not mentioned combinations.

Simplified Lexer

The code below is just a simplified version of the used Lexer. Some edge cases are not caught with this code. Full code can be found on Github with all tests.

With regular expression, the input string is divided into parts and then converted into tokens.

var tokenRegexp = regexp.MustCompile(`\(|\)|\*\*|\^|//|%|\+|\-|\*|/|(?P<num>(?:[0-9]+(?:\.[0-9]+)?|\.[0-9]+)(?:e[+-]?[0-9]+)?)|(?P<id>(?i)[a-z_][a-z0-9_]*)|(?P<ws>\s+)`)

type Lexer struct {
    expr string
}

func NewLexer(expression string) *Lexer {
    return &Lexer{expr: expression}
}

func (l *Lexer) Tokenize() ([]*Token, error) {
    expr := make([]*Token, 0)

    for _, match := range tokenRegexp.FindAllStringSubmatch(l.expr, -1) {
        t := &Token{}
        switch {
        case match[2] != "":
            t.tType = Identifier
            t.idName = match[2]
        case match[1] != "":
            t.tType = Number
            var err error
            if t.value, err = strconv.ParseFloat(match[1], 64); err != nil {
                return nil, TokenError(t, ErrInvalidNumber)
            }
        case match[3] != "": // Ignore whitespace for now and skip it
            continue
        default:
            t.tType = operatorTokenType(match[0])
        }
        expr = append(expr, t)
    }
    expr = append(expr, &Token{tType: EOL})
    return expr, nil
}

func operatorTokenType(operator string) TokenType {
    switch operator {
    case "(":
        return LPar
    case ")":
        return RPar
    case "^", "**":
        return Exponent
    case "*":
        return Multiplication
    case "/":
        return Division
    case "//":
        return FloorDiv
    case "%":
        return Modulus
    case "+":
        return Addition
    case "-":
        return Substraction
    }
    return Unknown
}
Enter fullscreen mode Exit fullscreen mode

Why we need tokens?

It would be possible to write a single module, which would parse and count the result at once. But splitting the logic into multiple modules is one of the best practices. Check Single-responsibility principle and optionally other design principles from SOLID.

If there would be a requirement to support inputs like 5 PLUS 8 TIMES 4 it would require only to change lexer. But not the rest of the program. The output of Lexer would be still the same.

Top comments (0)