DEV Community

Håvard Krogstie
Håvard Krogstie

Posted on

My favourite computer science text (OR "how regex works")

One of my favourite things in CS is reading about algorithms not as a tutorial, but as more of a historical presentation. My favourite example is https://swtch.com/~rsc/regexp/regexp2.html,
which discusses how Nondeterministic Finite Automata modelled as virtual machine bytecode can be used instead of traditional backtracking algorithms when regex matching. All these terms are explained in the text.

The reason I like the text so much is that the algorithms are thoroughly presented and explained, with references to original papers from the 1960s by names like Rob Pike and Ken Thompson. It's nice to recognize names "in the wild", and then see examples of just what those people did. Much better than reading their Wikipedia pages imo.

And more familiar (and unfamiliar) names appear as the history is discussed. The algorithm by Thompson was used in the original Unix grep. Alfred Aho (the A in awk) is one of the authors of the infamous Dragon Book, where the technique is discussed, but without clear attribution, as several people have made DFAs and NFAs from regular expressions, without necessarily saying they do (Thompson didn't, he just made the machine code).

Aho is of course the guy from the Aho-Corasick string searching algorithm, which I randomly had to learn at some point to solve a hackerrank-task, but which I mainly remember specifically because the name Aho suddenly appeared everywhere else afterward (I had held the Dragon book, but didn't bother remembering any of the authors at first).

Anyways, the article I shared is part 2/3 of a series, which is my favourite part for reasons discussed above, but you should absolutely read the other parts if you like it.

Image of AssemblyAI tool

Transforming Interviews into Publishable Stories with AssemblyAI

Insightview is a modern web application that streamlines the interview workflow for journalists. By leveraging AssemblyAI's LeMUR and Universal-2 technology, it transforms raw interview recordings into structured, actionable content, dramatically reducing the time from recording to publication.

Key Features:
🎥 Audio/video file upload with real-time preview
🗣️ Advanced transcription with speaker identification
⭐ Automatic highlight extraction of key moments
✍️ AI-powered article draft generation
📤 Export interview's subtitles in VTT format

Read full post

Top comments (2)

Collapse
 
patarapolw profile image
Pacharapol Withayasakpunt

It would be nice to know the exact limitations of Regex. Like, why cannot it parse XML?

Collapse
 
hkrogstie profile image
Håvard Krogstie • Edited

The article doesn't discuss much of the theory behind regex, but it does talk about DFA and NFAs, which are finite automata, and how regex can be turned into either, and how NFAs and DFAs can be turned into regex. The fact that they can be converted to each other means they are capable of recognizing exactly the same languages, known as regular languages.

Now what's important about "finite automata" is that they have a finite amount of states. Each letter parsed is equivalent to a state transition. Let's say you had a regex capable of recognizing all valid XML. If the corresponding finite automata has N states, and I give N+1 starting tags <mytag>, then XML requires N+1 </mytag>, but the finite automata can at most distinguish between N different amounts of starting tags. That means it won't be capable of counting down the N+1 ending tags that are required. It just doesn't have enough different states to track the progress from N+1 to 0. This argument holds for any regex you could ever try to recognize XML with. XML simply isn't a regular language. My argument closely resembles what's called the pumping lemma for regular languages.

Billboard image

The Next Generation Developer Platform

Coherence is the first Platform-as-a-Service you can control. Unlike "black-box" platforms that are opinionated about the infra you can deploy, Coherence is powered by CNC, the open-source IaC framework, which offers limitless customization.

Learn more

👋 Kindness is contagious

Discover a treasure trove of wisdom within this insightful piece, highly respected in the nurturing DEV Community enviroment. Developers, whether novice or expert, are encouraged to participate and add to our shared knowledge basin.

A simple "thank you" can illuminate someone's day. Express your appreciation in the comments section!

On DEV, sharing ideas smoothens our journey and strengthens our community ties. Learn something useful? Offering a quick thanks to the author is deeply appreciated.

Okay