DEV Community

loading...

Lets build a simple interpreter from scratch in python, pt.10: Tokenizer

smadev profile image sma Updated on ・2 min read

Coding a program using only recursive lists is very difficult and tedious. This kind of hard work must be done by computers. The first challenge we have to overcome is that we somehow transform human-readable program code into a structure that can be easily processed by the interpreter. Symbols, words, and values ​​must be separated into small meaningful structures, which are called tokens. In this post we will convert this:

a=1;
Enter fullscreen mode Exit fullscreen mode

to this:

[["name","a"], ["eq","="], ["number",1], ["semi",";"]]
Enter fullscreen mode Exit fullscreen mode

["name","a"] is a token like [type, value] . We currently don't need to create a Token class for simplicity's sake.

Warning: The code below does not check for edge cases. It was written to reach the result as soon as possible instead of wasting time to trying to be idiomatic.
Lets try:

# We need a generator function that yields items of strings or lists:
def gen(xs):
    for x in xs:
        yield x

def tokenize(xs):
    isalpha=lambda x: (x>="a" and x<="z") or (x>="A" and x<="Z") or x=="_"
    reserved="function,set,get,print,return,if,else,for,while,break,continue".split(",")
    symbols={"(":"lparen",")":"rparen","{":"lcurly","}":"rcurly",",":"comma",";":"semi",
             "=":"equal"}
    g=gen(xs)
    i=next(g)
    tokens=[]
    buf=""
    try:
        while True:
            # Skip whitespace;
            while i in " \r\n\t":
                i=next(g);

            # Token starts with alphas ;
            if isalpha(i):
                # and continues with alphanumeric
                while isalpha(i) or i in "0123456789":
                    buf+=i
                    i=next(g)
                if buf in reserved:
                    tokens.append( [buf,buf] )
                else:
                    tokens.append( ["name",buf] )
                buf=""

            # We found a string;
            if i in "\"'":
                delimiter=i
                i=next(g)
                while i!=delimiter:
                    buf+=i
                    i=next(g)
                # Skip delimiter;
                i=next(g) 
                tokens.append( ["string",buf] )
                buf=""

            # We found a symbol;
            if i in "(){};=,":
                tokens.append( [symbols[i],i] )
                i=next(g)

            # We found a number;
            if i in "+-0123456789":
                while i in "+-0123456789.e":
                    buf+=i
                    i=next(g)
                if "e" in buf or "." in buf:
                    tokens.append( ["number",float(buf)] )
                else:
                    tokens.append( ["number",int(buf)] )
                buf=""

    except StopIteration:
        pass
    return tokens

# test it;
print(tokenize("if (gt(a,b)) return a; else return b;"))
Enter fullscreen mode Exit fullscreen mode

output:

[['if', 'if'], ['lparen', '('], ['name', 'gt'], ['lparen', '('], ['name', 'a'], ['comma', ','], ['name', 'b'], ['rparen', ')'], ['rparen', ')'], ['return', 'return'], ['name', 'a'], ['semi', ';'], ['else', 'else'], ['return', 'return'], ['name', 'b'], ['semi', ';']]
Enter fullscreen mode Exit fullscreen mode

It works :P

In the next post we will write a parser class that converts 'a=1;' to '["Set","a",1]'

Links: Twitter,Patreon

Discussion (0)

pic
Editor guide