DEV Community

Cover image for Exploring the Linguistics Behind Regular Expressions

Exploring the Linguistics Behind Regular Expressions

Alaina Kafkes on November 20, 2017

Regular expressions inspire fear in new and experienced programmers alike. When I first saw a regular expression — often abbreviated as “regex” —...
Collapse
 
alephnaught2tog profile image
Max Cerrina • Edited

I love it :D I'm a regex fiend and I never get tired of them.

The one thing I wanted to mention, and it might just be I'm misreading, but I know it took me a long time to figure out when I first learned about it so I always like to clarify it: my understanding and experience thus far is that, assuming no flags, /+/ and /*/ are both greedy and NOT possessive in isolation; it is specifically appending + to itself or a /*/ that creates the non-backtracking.

ie:
/r*r/ will match "narwhal" and "narrative". It will also match the contrived "narrrrrrrp". It will match one less than the pattern needs, giving one back.

Similarly, as expected, /r+r/ will not match "narwhal" but will match "narrative". Same for "narrrrrrrp". (I feel like I missed a pirate joke opportunity...) If you do "narrrp" you can see that it is not possessive: it will match, whereas otherwise it shouldn't because it wouldn't allow the pattern to backtrack for that last "r".

/r*+r/ will match none of those, no matter how long a pirate noise you make.

I played around some on Regex101 with it--I added a capture group just to make it clearer. The /*/ and /+/ always matches one less in its capture group 1 than in the full capture group--until the match flat-out fails with /*+/. Same for "abcorgi"-- /.*corgi/ and /.+corgi/ both match "abcorgi"; but I believe /.*?corgi/ matches "bcorgi", /.+corgi/ matches "abcorgi", and /.*+corgi/ fails. The final version is "take any character as an option, as many as possible, and don't give any back"--which is going to stop just about everything else.

Collapse
 
alainakafkes profile image
Alaina Kafkes • Edited

Thank you for taking the time to write out these examples! You're correct that I have typos in two of the string examples I gave (which I'll update now).

my understanding and experience thus far is that, assuming no flags, /+/ and // are both greedy and NOT possessive in isolation; it is specifically appending + to itself or a // that creates the non-backtracking

You taught me something new today, thank you. :)

Collapse
 
courier10pt profile image
Bob van Hoove

Thanks for writing :) I really loved this subject when I was in college and this brings back some of that excitement.

I found it amazing what Chomsky had achieved. Escpecially how his theory affected the Behaviorist school of psychology.

The third reason for rejecting behaviorism is connected with Noam Chomsky.
Chomsky has been one of behaviorism's most successful and damaging critics. In a
review of Skinner's book on verbal behavior (see above), Chomsky (1959) charged
that behaviorist models of language learning cannot explain various facts about
language acquisition, such as the rapid acquisition of language by young children,
which is sometimes referred to as the phenomenon of “lexical explosion.”

source

Another takeaway I got from the Chomsky hierarchy is that if you seek for patterns in a text that are described by a grammar (replacement rules), regexes are not going to cut it. You'll need a parser.

Collapse
 
daveclarke profile image
daveclarke

You missed the best book on regexes, Mastering Regular Expressions shop.oreilly.com/product/978059652.... Still one of the best tech books I’ve read.

Collapse
 
dangolant profile image
Daniel Golant

"This affirmed my belief that all fields — even those that appear disparate from computer science — have something to offer to computing and the tech industry." -YES! Innovation and progress happen at the intersection of science/technology and other disciplines, it's important to stay well-rounded and have outside interests

Collapse
 
alainakafkes profile image
Alaina Kafkes

Couldn't agree more! 🎉🎉🎉

Collapse
 
hachi8833 profile image
hachi8833

I'd like to translate the article dev.to/alainakafkes/exploring-the-... into Japanese and publish on our tech blog techracho.bpsinc.jp/ for sharing it if you're OK.

I make sure to indicate the link to original, title, author name in the case.

Best regards,

Collapse
 
alainakafkes profile image
Alaina Kafkes

Hi Shozo! You can translate this blog post into Japanese. Please do give me credit and share a link with me when you're finished. ☺️

Collapse
 
hachi8833 profile image
hachi8833

Published the JP translation techracho.bpsinc.jp/hachi8833/2017...
Thank you for your kindness!

Collapse
 
hachi8833 profile image
hachi8833

Thank you for the permission! Sure I do that.

Collapse
 
theminshew profile image
Michael Minshew

Awesome example of the value of cross disciplinary study for both practical and fun. Loved the article thought it was well written and possibly the first linguistic based look on a CS concept I've read. Awesome!

Collapse
 
alainakafkes profile image
Alaina Kafkes

Thank you! ☺️

Collapse
 
dmerand profile image
Donald Merand

Thanks for writing this! I've got folks on my team who code, but don't have an academic computer science background. Approachable articles that cover the theoretical (especially linguistic!) underpinnings of CS are very very hard to come by. I shared this with my team + they all loved learning about it.

Collapse
 
alainakafkes profile image
Alaina Kafkes

Thank you! I strive to write without assuming a lot of CS knowledge so this is very high praise. I'm happy that your team enjoyed it ☺️

Collapse
 
andychiare profile image
Andrea Chiarelli

Great article!
I love "interdisciplinary contaminations". Chomsky is a wonderful example!

By the way, for those that are uncomfortable with standard regex syntax I suggest VerbalExpressions (github.com/VerbalExpressions)

Collapse
 
laviku profile image
Lavinia

It's weard, I was talking to a friend about how I simply don't understand regex, this post is really helpful, but still confusing, Spanish is my native language so the grammar is really different, but this post gave me the idea to invest more about it, Thank you so much!

Collapse
 
lauriy profile image
Lauri Elias

Had all this in Theoretical Computer Science.

Collapse
 
maria_michou profile image
Maria Michou

It's always fascinating to dig out memories/knowledge from the past (i.e. university course) and trigger one to read more about the subject. Great post!

Collapse
 
kapouer profile image
Jérémy Lal

And what about xkcd copyright ?

Collapse
 
alainakafkes profile image
Alaina Kafkes

Thanks for pointing this out! I couldn't figure out how to caption a cover image, so I added image credits at the end of my post.

Collapse
 
dougdescombaz profile image
doug descombaz

Great info. Inspires me to follow up with more reading on the connection with linguistics.

Collapse
 
daveshawley profile image
dave-shawley

Thank you for writing this. It reminded me what I miss about theoretical computer science after 15+ years in the field. It is well written and you didn’t even mention DFAs or Thompson’s construction 😜

Collapse
 
alainakafkes profile image
Alaina Kafkes

Wow, I do like this article! Thanks 🎉