DEV Community

Cover image for Parsing Python (Pogo Pt: 6)
Chig Beef
Chig Beef

Posted on

Parsing Python (Pogo Pt: 6)

Intro

In this series I am creating a transpiler from Python to Golang called Pogo. In the last post we created an object called a Structure, which we are going to use in this post to create an abstract syntax tree through the use of a parser.

What is a parser

A parser in most simple compilers will become the largest part of the project. This is because it has to check for many different syntax cases. The parser turns the slice of tokens that we received from the lexer into a tree structure. We do this because code is not linear like an array, but everything works in group. For example, an if statement would be a parent node to multiple print statements, and in that if statement we could have a loop. In this way we can create complicated tree structures through parsing the code.

The Parser Struct

The Parser as to keep track of 4 values. First, the source, which is the slice of Tokens we got from the lexer. We also need to keep track of our current position in the source, which is very similar to the lexer. We also need the current token we are looking at, and lastly we need a slice of ints for markers. Markers will be saved positions that we can use in case we need to go back in the source.

type Parser struct {
    curPos   int
    curToken Token
    source   []Token
    markers  []int
}
Enter fullscreen mode Exit fullscreen mode

Creating Helper Functions

We need to effectively be able to move around the source freely, so we will make some functions to help us with this.
Firstly, we need a function to move us onto the next token in the source.

func (p *Parser) nextToken() {
    p.curPos++
    if p.curPos >= len(p.source) {
        p.curToken = Token{} // Nil
    } else {
        p.curToken = p.source[p.curPos]
    }
}
Enter fullscreen mode Exit fullscreen mode

This should look familiar to code in the lexer, and simply gets the next token, and if there isn't one, just returns an empty value.
Now usually once we have the next token, we want something useful, and comments will get in the way of that, so we need a way to move over them without deleting them.

func (p *Parser) nextTokenNoNotes() []Structure {
    p.nextToken()
    sts := []Structure{}

    for p.curToken.code == tokenCode["COMMENT_ONE"] || p.curToken.code == tokenCode["NEWLINE"] {
        sc := structureCode["COMMENT_ONE"]
        if p.curToken.code == tokenCode["NEWLINE"] {
            sc = structureCode["NEWLINE"]
        }
        sts = append(sts, Structure{[]Structure{}, sc, p.curToken.text})
        p.nextToken()
    }
    return sts
}
Enter fullscreen mode Exit fullscreen mode

This function creates a slice of Structures to hold all the newlines and comments. We keep appending to this slice as long as we can see notes and newlines. Since structure codes and token codes don't align perfectly, we need to check within the loop which Token we got and create the Structure accordingly
Next, we have a function that can tell us what the future token is going to be.

func (p *Parser) peek() Token {
    if p.curPos >= len(p.source)-1 {
        return Token{}
    }
    return p.source[p.curPos+1]
}
Enter fullscreen mode Exit fullscreen mode

Similarly, it would be good to move back just one token, in case we had a choice and we chose the wrong path.

func (p *Parser) rollBack() {
    p.curPos--
    if p.curPos < len(p.source) {
        p.curToken = Token{} // Nil
    } else {
        p.curToken = p.source[p.curPos]
    }
}
Enter fullscreen mode Exit fullscreen mode

There are quite a few similarities between these functions, find which token we want, check if it exists, return accordingly.

func (p *Parser) setMarker() {
    p.markers = append(p.markers, p.curPos)
}

func (p *Parser) gotoMarker() {
    p.curPos = p.markers[len(p.markers)-1]
    p.curToken = p.source[p.curPos]
    p.markers = p.markers[:len(p.markers)-1]
}
Enter fullscreen mode Exit fullscreen mode

These 2 functions are used for when we are deep in a parsing branch and need to be pulled out. Before we get into the branch, we set a marker as a safety net. As soon as we realize that we are in the wrong branch, we call gotoMarker and it will pull as back to the most recently placed marker. gotoMarker also removes that marker.

Indentation

The syntax of Python that will make our lives slightly harder is the fact it uses indentation to group code rather than literally anything else. In Python we don't end up with a clear character marker that shows us that a block has ended, as apposed to Go which tells us with a } or even Ruby that uses end. So now we have to figure out when groups end by ourselves.
All we know is that when the indentation of this line is lower than the previous line's indentation, we have a block end.

func (p *Parser) replaceIndents(input []Token) []Token {
    indents := []int{0}
    curIndex := 0
    for i := 0; i < len(input); i++ {
        if input[i].code == tokenCode["NEWLINE"] {
            curIndex++
            indents = append(indents, 0)
        } else if input[i].code == tokenCode["INDENT"] {
            indents[curIndex]++
        }
    }
    output := []Token{}
    curIndex = 0
    for i := 0; i < len(input); i++ {
        if input[i].code == tokenCode["NEWLINE"] {
            curIndex++
            if indents[curIndex] < indents[curIndex-1] {
                for j := 0; j < indents[curIndex-1]-indents[curIndex]; j++ {
                    output = append(output, Token{tokenCode["ANTI_COLON"], ":"})
                }
            }
        }
        if input[i].code == tokenCode["INDENT"] {
            continue
        }
        output = append(output, input[i])
    }
    return output
}
Enter fullscreen mode Exit fullscreen mode

To start of this code, we are going to count the indentation level of each line. We do this by storing these indentation levels in a slice of ints. We loop over the source, incrementing the indentation level every time we come across and indent, but when we find a newline we create another line in the slice to store more information.

Now we need to start placing in the block ends, which I have called anti-colons because they do the opposite of a colon in Python.
Firstly, lets look at the second if statement in the loop. This if statement simply checks if we have an indent, and by using continue we don't add it to output, which will be the slice of Tokens to replace our source. In this way our final source code should contain no indents.

The first if statement, similar to the if statement in the first loop, checks if we are starting a new line, and if we are then we need to increment our counter so we are looking at the indentation level of the correct line.
Now we need to check whether we should be adding an anti-colon in here, and we only need to if the indentation level of our current line is less than the indentation level of the previous line, therefore implying we have finished a block.
If this is the case, we add not one anti-colon, but as many as we need to match up the indentation levels. We need to do this incase the last line was a nested structure, which usually ends in 2 }s.

This all works so far until you end the Python source file with indentation, because there's no next line to check whether it's got lower indentation than the previous line.
To fix this, we keep count up all of the colons and anti-colons in the code. If there is a difference then we need to add more anti-colons to the end, which is a simple process.

We end the function by returning output with an extra newline, which helps keep us from leaving the bounds of the source (adds a margin of error).

Checking the Source

Now that we have a way of moving around the source, and have made the source easier to work with, we can now implement more methods to simplify the checking process.

Firstly, we need a way to check whether the current token we are looking at is what we expect.

func (p *Parser) checkToken(tokenKey string) (Structure, error) {
    if p.curToken.code == tokenCode[tokenKey] {
        return Structure{[]Structure{}, structureCode[tokenKey], p.curToken.text}, nil
    }
    return Structure{}, errors.New("[Parse (checkToken)] Expected " + tokenKey + ", got " + p.curToken.text)
}
Enter fullscreen mode Exit fullscreen mode

This is a very simple method that we can use all the time, and we check whether the given error is nil to decide whether we got what we wanted.

What if we have a bunch of tokens in a row that we want to check? We could use checkToken repeatedly, but then we would be using checkToken and error checking for every check, which can bloat the code quickly. Instead we will create another method that simply checks whether the next range of tokens is correct.

func (p *Parser) checkTokenRange(tokenKeys []string) ([]Structure, error) {
    structures := []Structure{}
    for i := 0; i < len(tokenKeys); i++ {
        temp, err := p.checkToken(tokenKeys[i])
        if err != nil {
            return structures, err
        }
        structures = append(structures, temp)
        p.nextToken()
    }
    return structures, nil
}
Enter fullscreen mode Exit fullscreen mode

This method returns the first error it comes across, or a slice of Structures which should correlate with what we were checking for.

Another useful method would be to simplify a simply choice between tokens that could be possible. For example, if we are assigning a value to a variable, that value could be an identifier, or any literal. And since this isn't the only case where this occurs, it would be nice to keep this process simple.

func (p *Parser) checkTokenChoices(tokenKeys []string) (Structure, error) {
    for i := 0; i < len(tokenKeys); i++ {
        if p.curToken.code == tokenCode[tokenKeys[i]] {
            return Structure{[]Structure{}, structureCode[tokenKeys[i]], p.curToken.text}, nil
        }
    }
    errText := ""
    for i := 0; i < len(tokenKeys); i++ {
        errText += tokenKeys[i]
        errText += " or "
    }
    errText = errText[:len(errText)-4]
    return Structure{}, errors.New("[Parse (checkToken)] Expected " + errText + ", got " + p.curToken.text)
}
Enter fullscreen mode Exit fullscreen mode

Similar to checkTokenRange we decide whether our token is valid by comparing it against a slice of keys. The main section we are interested in is the first loop. This simply returns as soon as it has found a valid option.
The rest of the code is just for creating the error text, which will list all of the values that we were looking for.

Now, we have gone through a lot of code so I'll explain the rest of the parser in the next post. The parser did end up being quite large, even after the generalizations we just made with these methods

Next

In the next post we will create the parse method on the parser, which, similar to lex on the lexer, will be the main method that will return our output, which in this case will be an abstract syntax tree. This method will utilize many more methods to create this tree, with each method specializing in blocks, statements, comparisons, etc.

Top comments (0)