DEV Community

Cover image for How to Build a RegEx Engine in Python (Part 5: Parser conclusion)
Lorenzo Felletti
Lorenzo Felletti

Posted on • Originally published at python.plainenglish.io

How to Build a RegEx Engine in Python (Part 5: Parser conclusion)

In this article, we’ll dig rapidly through the code of the parser, which will produce the AST we discussed in the previous article in output.

This will hopefully be the last piece about the parser, and then we’ll move on to the engine and the juicy part, the backtracking system.

So, the parser is contained in the Pyrser class, as we already saw in Part 3, which has the method parse, that takes as input a regular expression in the form of a string, and gives us the corresponding AST in output.

Pyrser structure

 raw `Pyrser` endraw  structure.

Pyrser structure.

We already spoke about the various functions and what is their purpose in previous articles.

Now we will look closer to the code of some of them. I will not explain what each line does, this isn’t the purpose of this series of articles, but I will rather express some final thoughts after.

If you’re not super interested in the code, feel free to skip reading through it.

Parse Regex Sequence

 raw `parse_re_seq` endraw  code.

parse_re_seq code.

Parse Group

 raw `parse_group` endraw  code.

parse_group code.

Parse Element

 raw `parse_el` endraw  code.

parse_el code.

Final Words

In conclusion, the parser was not as hard to code as hard to debug.

When I first coded it I did it quickly and without having a perfectly clear idea in mind of what I wanted in output.

In fact, the parse_re function is indeed a leftover of an error:

  • in the beginning, the ^ and $ symbols were meant to be “absolute”, meaning that once typed, they would have been applied to every single regex in the sequence
  • This meant that a regex such as ^a|b$ returned false on test string cb, while it should return true because that b must be at the beginning isn’t specified in the second regex, but this original implementation maintained the “match start” from the other one
  • This happened because at the beginning I thought that should be the “right” behavior, and I created the RE node of the AST just for this purpose.

I later realized that this behavior was confusing and didn’t make any sense, so I had to figure out a way to change it, possibly maintaining the RE node, because the engine was already coded and using it.

In the end, parse_re remained, completely emptied of any behavior, useless, for the sake of compatibility with the — already working — engine.

Apart from this, the parser and the AST does a really good job overall, and coding a TDRDP parser isn’t so hard once you figured out a good grammar, which is no joke for non-trivial projects.

Top comments (0)