DEV Community

Cover image for How to Build a RegEx Engine in Python (Part 4: The AST)
Lorenzo Felletti
Lorenzo Felletti

Posted on • Originally published at Medium

How to Build a RegEx Engine in Python (Part 4: The AST)

In this article, I will present you the process I followed to end up with the final AST of the solution, and which are its strength and flaws (because yes, following my articles you won’t create a perfect solution, sorry).

So, let’s start with the logic blocks I wanted to have to build my AST.

I felt I needed:

  • a Group block (to group various elements)
  • an Element block, or more in general a Leaf block (used for elements, wildcards, match start/end, range elements, …)
  • an Or block, to express the alternative
  • a Regex block, to express the “single point of entry” (kinda useless, it will turn out)

So, each regex will have a root node called RE, the fancy name I chose for the top-level block acting as a single point of entry.

This node is actually quite useless, but also harmless, and in the next article, I will briefly discuss why it stayed in the final solution.

Having these blocks clear in mind, I created this hierarchy of Python classes:

Hierarchy of AST nodes.

The base class is the ASTNode, and the RE, Group, and Or nodes directly inherit from it.

All the other nodes will be leaves of our AST, meaning they will not have children nodes, so I created the intermediate class LeafNode (inheriting from ASTNode) from which they all inherit, and that serves as a base class to group them all.

Oops, I made a mistake, and then another (maybe)

I decided on the fly that it would have been good for the RE node to only have one child, so I called this property child.

Following the same idea, the GroupNode has a children property named children, and the OrNode has two properties left and right for its children.

This was making total sense in my mind, until I had to code the engine, in which, as we will see, we want to iterate through the children of these nodes.

You can guess the error now: having all these different names for (basically) the same property is not a good design choice.

You cannot write node.children and move on, you have to check the type of the node, and then access the children using the right field name.

This would make the code more error-prone, uselessly longer, and unnecessarily leads to WET code.

We need to DRY this code, but how? We need to call all these properties with the same name: children.

But the parser is already using the child, children, left, and right properties, and it would be painful to rewrite the code using the newborn property.

This is when I had the “eureka!”:

Just add the children property to every class and handle the mapping under the hood.

A (mostly harmless) flaw

The problem is that, as it is implemented, if someone changes, for example, the right child using the node.right property after the node is created the change is not propagated to the children property, and vice-versa.

This is an error, actually, but I was too lazy to figure out a different solution, and after all, changing a node after its creation doesn’t make much sense, so we can supersede this error.

Conclusion

To conclude, the AST hierarchy we created is good enough to don’t end up being an obstacle to our project, even if it has a couple flaws, not very relevant in our context.

Anyway, it is crucial to notice these flaws, and trying to avoid making the mistakes leading to them next time would be good for sure.


Cover image by Jr Korpa on Unsplash

Top comments (0)