Exploring the Linguistics Behind Regular Expressions

Alaina Kafkes on November 20, 2017

Regular expressions inspire fear in new and experienced programmers alike. When I first saw a regular expression — often abbreviated as “regex” —... [Read Full]
markdown guide
 

I love it :D I'm a regex fiend and I never get tired of them.

The one thing I wanted to mention, and it might just be I'm misreading, but I know it took me a long time to figure out when I first learned about it so I always like to clarify it: my understanding and experience thus far is that, assuming no flags, /+/ and /*/ are both greedy and NOT possessive in isolation; it is specifically appending + to itself or a /*/ that creates the non-backtracking.

ie:
/r*r/ will match "narwhal" and "narrative". It will also match the contrived "narrrrrrrp". It will match one less than the pattern needs, giving one back.

Similarly, as expected, /r+r/ will not match "narwhal" but will match "narrative". Same for "narrrrrrrp". (I feel like I missed a pirate joke opportunity...) If you do "narrrp" you can see that it is not possessive: it will match, whereas otherwise it shouldn't because it wouldn't allow the pattern to backtrack for that last "r".

/r*+r/ will match none of those, no matter how long a pirate noise you make.

I played around some on Regex101 with it--I added a capture group just to make it clearer. The /*/ and /+/ always matches one less in its capture group 1 than in the full capture group--until the match flat-out fails with /*+/. Same for "abcorgi"-- /.*corgi/ and /.+corgi/ both match "abcorgi"; but I believe /.*?corgi/ matches "bcorgi", /.+corgi/ matches "abcorgi", and /.*+corgi/ fails. The final version is "take any character as an option, as many as possible, and don't give any back"--which is going to stop just about everything else.

 

Thank you for taking the time to write out these examples! You're correct that I have typos in two of the string examples I gave (which I'll update now).

my understanding and experience thus far is that, assuming no flags, /+/ and // are both greedy and NOT possessive in isolation; it is specifically appending + to itself or a // that creates the non-backtracking

You taught me something new today, thank you. :)

 

Thanks for writing :) I really loved this subject when I was in college and this brings back some of that excitement.

I found it amazing what Chomsky had achieved. Escpecially how his theory affected the Behaviorist school of psychology.

The third reason for rejecting behaviorism is connected with Noam Chomsky.
Chomsky has been one of behaviorism's most successful and damaging critics. In a
review of Skinner's book on verbal behavior (see above), Chomsky (1959) charged
that behaviorist models of language learning cannot explain various facts about
language acquisition, such as the rapid acquisition of language by young children,
which is sometimes referred to as the phenomenon of “lexical explosion.”

source

Another takeaway I got from the Chomsky hierarchy is that if you seek for patterns in a text that are described by a grammar (replacement rules), regexes are not going to cut it. You'll need a parser.

 
 

You missed the best book on regexes, Mastering Regular Expressions shop.oreilly.com/product/978059652.... Still one of the best tech books I’ve read.

 

"This affirmed my belief that all fields — even those that appear disparate from computer science — have something to offer to computing and the tech industry." -YES! Innovation and progress happen at the intersection of science/technology and other disciplines, it's important to stay well-rounded and have outside interests

 
 

I'd like to translate the article dev.to/alainakafkes/exploring-the-... into Japanese and publish on our tech blog techracho.bpsinc.jp/ for sharing it if you're OK.

I make sure to indicate the link to original, title, author name in the case.

Best regards,

 

Hi Shozo! You can translate this blog post into Japanese. Please do give me credit and share a link with me when you're finished. ☺️

 
 
 

It's weard, I was talking to a friend about how I simply don't understand regex, this post is really helpful, but still confusing, Spanish is my native language so the grammar is really different, but this post gave me the idea to invest more about it, Thank you so much!

 

Thanks for writing this! I've got folks on my team who code, but don't have an academic computer science background. Approachable articles that cover the theoretical (especially linguistic!) underpinnings of CS are very very hard to come by. I shared this with my team + they all loved learning about it.

 

Thank you! I strive to write without assuming a lot of CS knowledge so this is very high praise. I'm happy that your team enjoyed it ☺️

 

Awesome example of the value of cross disciplinary study for both practical and fun. Loved the article thought it was well written and possibly the first linguistic based look on a CS concept I've read. Awesome!

 
 

Great article!
I love "interdisciplinary contaminations". Chomsky is a wonderful example!

By the way, for those that are uncomfortable with standard regex syntax I suggest VerbalExpressions (github.com/VerbalExpressions)

 

Great info. Inspires me to follow up with more reading on the connection with linguistics.

 
 

It's always fascinating to dig out memories/knowledge from the past (i.e. university course) and trigger one to read more about the subject. Great post!

 

Thank you for writing this. It reminded me what I miss about theoretical computer science after 15+ years in the field. It is well written and you didn’t even mention DFAs or Thompson’s construction 😜

 
 

Thanks for pointing this out! I couldn't figure out how to caption a cover image, so I added image credits at the end of my post.

code of conduct - report abuse