loading...

My favourite computer science text (OR "how regex works")

hkrogstie profile image Håvard Krogstie ・2 min read

One of my favourite things in CS is reading about algorithms not as a tutorial, but as more of a historical presentation. My favourite example is https://swtch.com/~rsc/regexp/regexp2.html,
which discusses how Nondeterministic Finite Automata modelled as virtual machine bytecode can be used instead of traditional backtracking algorithms when regex matching. All these terms are explained in the text.

The reason I like the text so much is that the algorithms are thoroughly presented and explained, with references to original papers from the 1960s by names like Rob Pike and Ken Thompson. It's nice to recognize names "in the wild", and then see examples of just what those people did. Much better than reading their Wikipedia pages imo.

And more familiar (and unfamiliar) names appear as the history is discussed. The algorithm by Thompson was used in the original Unix grep. Alfred Aho (the A in awk) is one of the authors of the infamous Dragon Book, where the technique is discussed, but without clear attribution, as several people have made DFAs and NFAs from regular expressions, without necessarily saying they do (Thompson didn't, he just made the machine code).

Aho is of course the guy from the Aho-Corasick string searching algorithm, which I randomly had to learn at some point to solve a hackerrank-task, but which I mainly remember specifically because the name Aho suddenly appeared everywhere else afterward (I had held the Dragon book, but didn't bother remembering any of the authors at first).

Anyways, the article I shared is part 2/3 of a series, which is my favourite part for reasons discussed above, but you should absolutely read the other parts if you like it.

Posted on by:

hkrogstie profile

Håvard Krogstie

@hkrogstie

Played around with programming in different languages over the years. Trying to unlearn old habits while keeping those worth keeping. Doing algorithms contests, bronze medal from IOI.

Discussion

markdown guide
 

It would be nice to know the exact limitations of Regex. Like, why cannot it parse XML?

 

The article doesn't discuss much of the theory behind regex, but it does talk about DFA and NFAs, which are finite automata, and how regex can be turned into either, and how NFAs and DFAs can be turned into regex. The fact that they can be converted to each other means they are capable of recognizing exactly the same languages, known as regular languages.

Now what's important about "finite automata" is that they have a finite amount of states. Each letter parsed is equivalent to a state transition. Let's say you had a regex capable of recognizing all valid XML. If the corresponding finite automata has N states, and I give N+1 starting tags <mytag>, then XML requires N+1 </mytag>, but the finite automata can at most distinguish between N different amounts of starting tags. That means it won't be capable of counting down the N+1 ending tags that are required. It just doesn't have enough different states to track the progress from N+1 to 0. This argument holds for any regex you could ever try to recognize XML with. XML simply isn't a regular language. My argument closely resembles what's called the pumping lemma for regular languages.