DEV Community

Lorenzo Felletti
Lorenzo Felletti

Posted on • Originally published at python.plainenglish.io

How to Build a RegEx Engine in Python (Part 7: The Backtracking System)

In this article, we will complete our regex engine with the backtracking system.

This article is the seventh part of my series on how to build a regular expression engine, so if this is the first article you read, I advise you to read the previous ones and then come back to this one.

The backtracking feature was one of the most difficult to implement and is by far the most interesting feature of this project.

Before coming up with the final and (hopefully) correct version of the backtracking system, I implemented a couple of “partially correct” solutions, each of them not working in some edge cases.

To find out all these bugs, testing was crucial, and mainly unit testing. If you don’t already use unit tests in your projects, I strongly suggest you start using them.

When to Backtrack?

To figure out when it is proper to backtrack, and when it is not, is not a trivial task, but it is the first thing you have to sort out when implementing backtracking.

First, you have to understand what your engine can or cannot already do, to not implement the same feature twice.

For example, our engine so far can already do some backtracking: in case of an or (OrNode), if the left branch fails it tries to match the right branch, so we don’t have to implement this in our backtracking system.

When trying to match a regular expression to a given string, let’s call it test string, and you have an engine requiring backtracking (as we saw in part 6) you have to operate these steps:

  • Start from index 0 of the test string, and the leftmost leaf of the AST representing the regex
  • Read the first character of the test string and the leftmost leaf of the AST
  • Do they match? If yes, move on (index 1 or the test string, next leaf in the AST), if not try to backtrack
  • If backtracking fails, then you have a complete fail: the test string does not match the regex.

Backtracking is possible if:

  • I matched an element more times than the minimum number of times it has to be matched(e.g. we have an A to be matched 1+ times, and we matched it more than once)
  • An element B is matched the minimum number of times, but a previous element A is matched more than the minimum number of times.

Groups Sweet Groups!

Our match_group function walks the AST going down a level every time a non-leaf node is encountered, by calling itself recursively.

So, from a match_group function call point of view, “the world is flat” — what it does is just to iterate through its children, children that are all at the same depth in the AST.

The only difference is that a leaf node is encountered there isn’t another function call to get to the new state, while when a non-leaf is found, another match_group is called, and at the end of its execution, the upper level gets the new state.

For a single match_group there is no tree, just a list of elements to iterate through.

And we can exploit this to make our backtracking system also “flat”, meaning that it takes care of only one level.

So, we will define a “flat” backtracking system, and every “level” (i.e. match_group) will have its own.

In doing this, we greatly simplify the backtracking system responsibility, but without losing generality, because the call stack will take care of executing the operations in the right order. In a way, the “depth management” part of our system comes “for free”, thanks to the call stack.

How to Backtrack?

Thanks to the fact that our engine works with groups, we don’t have to create “global” backtracking, but just a system capable of handling backtracking inside a group.

To backtrack, we have to store somewhere the information about the explored path(s), and we also need a strategy to explore all the possible paths.

The information to store is pretty simple:

  • An array (that we will use as a stack — called backtrack_stack) containing one element for every visited node.

Each element of the array, representing a visited node, must contain the following information:

  • the visited node(you don’t even need the whole node, but only its index in the children array — the aforementioned array on which match_groupoperates)
  • how many times have you matched a node
  • the minimum amount of times it must be matched
  • an array of how many characters were consumed each time it was matched.

That’s it, this is all the information we need to store in the backtrack stack every time we visit a node.

A backtrack stack will be created for each match_group invocation, tracking only the path explored for sibling nodes.

Let’s Backtrack

Now, with the backtrack stack so defined and the backtracking rules I described earlier, we can state that:

  1. When there is a request for backtracking and the element on top of the stack is matched more times than the minimum, the last consumption must be removed from its consumption list and the consumed characters of the test string updated as follows: test_str_consumed_characters -= last_consumption
  2. When there is a request for backtracking and the element on top of the stack is already matched the minimum number of times, the element should be removed and the consumed characters of the test string updated as follows: foreach consumption in consumed_list_of_the_element: test_str_consumed_characters -= consumption
  3. When there is a request for backtracking on an empty backtrack stack, all the possible paths have been explored and the request must cause the failure of the match (i.e. regex and test string doesn’t match).

By applying these simple three rules in our code, and exploiting the recursion of match_group, we cover all the possible backtracking cases we can encounter, and we obtain the full-fledged RegEx Engine that was our goal to build.

Conclusion

With this article we finished building the RegEx Engine we wanted to build, thus this article concludes this series.

However, if one day I will want to deepen aspects of this project I didn’t so far, I cannot rule out that I will write other articles about this topic, but for now, the series is complete enough in my opinion.

Let me know if you liked this series of articles, or if you have any suggestions or doubts about the topics discussed here.

Top comments (0)