DEV Community

Exploring the Linguistics Behind Regular Expressions

Alaina Kafkes on November 20, 2017

Regular expressions inspire fear in new and experienced programmers alike. When I first saw a regular expression — often abbreviated as “regex” — I...

Read full post

Max Cerrina • Nov 21 '17 • Edited

I love it :D I'm a regex fiend and I never get tired of them.

The one thing I wanted to mention, and it might just be I'm misreading, but I know it took me a long time to figure out when I first learned about it so I always like to clarify it: my understanding and experience thus far is that, assuming no flags, /+/ and /*/ are both greedy and NOT possessive in isolation; it is specifically appending + to itself or a /*/ that creates the non-backtracking.

ie:
/r*r/ will match "narwhal" and "narrative". It will also match the contrived "narrrrrrrp". It will match one less than the pattern needs, giving one back.

Similarly, as expected, /r+r/ will not match "narwhal" but will match "narrative". Same for "narrrrrrrp". (I feel like I missed a pirate joke opportunity...) If you do "narrrp" you can see that it is not possessive: it will match, whereas otherwise it shouldn't because it wouldn't allow the pattern to backtrack for that last "r".

/r*+r/ will match none of those, no matter how long a pirate noise you make.

I played around some on Regex101 with it--I added a capture group just to make it clearer. The /*/ and /+/ always matches one less in its capture group 1 than in the full capture group--until the match flat-out fails with /*+/. Same for "abcorgi"-- /.*corgi/ and /.+corgi/ both match "abcorgi"; but I believe /.*?corgi/ matches "bcorgi", /.+corgi/ matches "abcorgi", and /.*+corgi/ fails. The final version is "take any character as an option, as many as possible, and don't give any back"--which is going to stop just about everything else.

Alaina Kafkes • Nov 21 '17 • Edited

Thank you for taking the time to write out these examples! You're correct that I have typos in two of the string examples I gave (which I'll update now).

my understanding and experience thus far is that, assuming no flags, /+/ and // are both greedy and NOT possessive in isolation; it is specifically appending + to itself or a // that creates the non-backtracking

You taught me something new today, thank you. :)

Bob van Hoove • Nov 20 '17

Thanks for writing :) I really loved this subject when I was in college and this brings back some of that excitement.

I found it amazing what Chomsky had achieved. Escpecially how his theory affected the Behaviorist school of psychology.

The third reason for rejecting behaviorism is connected with Noam Chomsky.
Chomsky has been one of behaviorism's most successful and damaging critics. In a
review of Skinner's book on verbal behavior (see above), Chomsky (1959) charged
that behaviorist models of language learning cannot explain various facts about
language acquisition, such as the rapid acquisition of language by young children,
which is sometimes referred to as the phenomenon of “lexical explosion.”

source

Another takeaway I got from the Chomsky hierarchy is that if you seek for patterns in a text that are described by a grammar (replacement rules), regexes are not going to cut it. You'll need a parser.

Daniel Golant • Dec 1 '17

"This affirmed my belief that all fields — even those that appear disparate from computer science — have something to offer to computing and the tech industry." -YES! Innovation and progress happen at the intersection of science/technology and other disciplines, it's important to stay well-rounded and have outside interests

Alaina Kafkes • Dec 3 '17

Couldn't agree more! 🎉🎉🎉

daveclarke • Nov 27 '17

You missed the best book on regexes, Mastering Regular Expressions shop.oreilly.com/product/978059652.... Still one of the best tech books I’ve read.

hachi8833 • Dec 3 '17

I'd like to translate the article dev.to/alainakafkes/exploring-the-... into Japanese and publish on our tech blog techracho.bpsinc.jp/ for sharing it if you're OK.

I make sure to indicate the link to original, title, author name in the case.

Best regards,

Alaina Kafkes • Dec 3 '17

Hi Shozo! You can translate this blog post into Japanese. Please do give me credit and share a link with me when you're finished. ☺️

hachi8833 • Dec 27 '17

Published the JP translation techracho.bpsinc.jp/hachi8833/2017...
Thank you for your kindness!

hachi8833 • Dec 3 '17

Thank you for the permission! Sure I do that.

Lavinia • Nov 28 '17

It's weard, I was talking to a friend about how I simply don't understand regex, this post is really helpful, but still confusing, Spanish is my native language so the grammar is really different, but this post gave me the idea to invest more about it, Thank you so much!

Michael Minshew • Mar 2 '18

Awesome example of the value of cross disciplinary study for both practical and fun. Loved the article thought it was well written and possibly the first linguistic based look on a CS concept I've read. Awesome!

Alaina Kafkes • Mar 2 '18

Thank you! ☺️

Andrea Chiarelli • Nov 21 '17

Great article!
I love "interdisciplinary contaminations". Chomsky is a wonderful example!

By the way, for those that are uncomfortable with standard regex syntax I suggest VerbalExpressions (github.com/VerbalExpressions)

Donald Merand • Mar 6 '18

Thanks for writing this! I've got folks on my team who code, but don't have an academic computer science background. Approachable articles that cover the theoretical (especially linguistic!) underpinnings of CS are very very hard to come by. I shared this with my team + they all loved learning about it.

Alaina Kafkes • Mar 8 '18

Thank you! I strive to write without assuming a lot of CS knowledge so this is very high praise. I'm happy that your team enjoyed it ☺️

Lauri Elias • Nov 27 '17

Had all this in Theoretical Computer Science.

Maria Michou • Nov 27 '17

It's always fascinating to dig out memories/knowledge from the past (i.e. university course) and trigger one to read more about the subject. Great post!

Jérémy Lal • Jan 19 '18

And what about xkcd copyright ?

Alaina Kafkes • Jan 21 '18

Thanks for pointing this out! I couldn't figure out how to caption a cover image, so I added image credits at the end of my post.

doug descombaz • Jan 20 '18

Great info. Inspires me to follow up with more reading on the connection with linguistics.

dave-shawley • Jan 23 '18

Thank you for writing this. It reminded me what I miss about theoretical computer science after 15+ years in the field. It is well written and you didn’t even mention DFAs or Thompson’s construction 😜

Alaina Kafkes • Nov 29 '17

Wow, I do like this article! Thanks 🎉