Lets build a simple interpreter from scratch in python, pt.10: Tokenizer

#interpreter #fromstratch #python

Coding a program using only recursive lists is very difficult and tedious. This kind of hard work must be done by computers. The first challenge we have to overcome is that we somehow transform human-readable program code into a structure that can be easily processed by the interpreter. Symbols, words, and values must be separated into small meaningful structures, which are called tokens. In this post we will convert this:

a=1;

to this:

[["name","a"], ["eq","="], ["number",1], ["semi",";"]]

["name","a"] is a token like [type, value] . We currently don't need to create a Token class for simplicity's sake.

Warning: The code below does not check for edge cases. It was written to reach the result as soon as possible instead of wasting time to trying to be idiomatic.
Lets try:

# We need a generator function that yields items of strings or lists:
def gen(xs):
    for x in xs:
        yield x

def tokenize(xs):
    isalpha=lambda x: (x>="a" and x<="z") or (x>="A" and x<="Z") or x=="_"
    reserved="function,set,get,print,return,if,else,for,while,break,continue".split(",")
    symbols={"(":"lparen",")":"rparen","{":"lcurly","}":"rcurly",",":"comma",";":"semi",
             "=":"equal"}
    g=gen(xs)
    i=next(g)
    tokens=[]
    buf=""
    try:
        while True:
            # Skip whitespace;
            while i in " \r\n\t":
                i=next(g);

            # Token starts with alphas ;
            if isalpha(i):
                # and continues with alphanumeric
                while isalpha(i) or i in "0123456789":
                    buf+=i
                    i=next(g)
                if buf in reserved:
                    tokens.append( [buf,buf] )
                else:
                    tokens.append( ["name",buf] )
                buf=""

            # We found a string;
            if i in "\"'":
                delimiter=i
                i=next(g)
                while i!=delimiter:
                    buf+=i
                    i=next(g)
                # Skip delimiter;
                i=next(g) 
                tokens.append( ["string",buf] )
                buf=""

            # We found a symbol;
            if i in "(){};=,":
                tokens.append( [symbols[i],i] )
                i=next(g)

            # We found a number;
            if i in "+-0123456789":
                while i in "+-0123456789.e":
                    buf+=i
                    i=next(g)
                if "e" in buf or "." in buf:
                    tokens.append( ["number",float(buf)] )
                else:
                    tokens.append( ["number",int(buf)] )
                buf=""

    except StopIteration:
        pass
    return tokens

# test it;
print(tokenize("if (gt(a,b)) return a; else return b;"))

output:

[['if', 'if'], ['lparen', '('], ['name', 'gt'], ['lparen', '('], ['name', 'a'], ['comma', ','], ['name', 'b'], ['rparen', ')'], ['rparen', ')'], ['return', 'return'], ['name', 'a'], ['semi', ';'], ['else', 'else'], ['return', 'return'], ['name', 'b'], ['semi', ';']]