Kelly Stannard

Posted on Jul 30, 2020

Why not regexp?

#beginners #design

I have been using regexp daily for 8 years now and I have opinions. The most controversial one is that writing regexp to do things in production code is like using dynamite to take care of an ant hill.

I often see regexp used because it is so versatile and can perform very complex operations. But, regexp is hard to think about and therefore dangerous. You need a clear head and mental focus and years and years of experience to use it and even then you may still blow off your arm.

Bottom line is that in prod regexp should really be seen as a tool of last resort rather than the first thing you look to.

Given that, where exactly am I using regexp daily? I just use it locally for development in the safety of source control where I can easily revert my mistakes. :)

Top comments (5)

mattother • Aug 4 '20

I would both agree and disagree. I think regex is the right tool for the job in a lot of cases. However, I think it's common to abuse regex, as you alluded to and it's also common to misunderstand regexes applications. I think it's the perfect example of a leaky abstraction.

To me regex is inherently bad at representing sub-structures in complex cases. And the more conditional permutations that exist within the matching domain, the harder it becomes to write a valid regex statement (and even harder to write a clear regex statement).

Part of the issue is there's no clarity of the business logic handled by a regex statement, so if it does too much, it becomes unclear what cases it handles. Tests can help, but personally I find handling a simpler domain is usually a better solution.

Also like most programming examples, regex examples are usually not production ready and are really not representative of reality. Also given that most regex resources are out-of-date, none of them generally address unicode (or emojis).

Urls are a good example. Matching general urls with regex is a terrible idea in my opinion. For two reasons, one because what constitutes a valid url is often misunderstand and two because there are generally already better resources available for doing this.

Case and point from urlregex.com, the 'The Perfect URL Regular Expression', doesn't even function properly.

Here's the python matching regex:

http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+

And here's a simple example where it breaks (since it does not handle unicode):

test.ca/so/önicode

Here's another example I've seen come up quite a bit in projects. Somebody will write something like below to match the domain:

/^https?:\/\/([a-z.]+)/

This is obviously a terrible regex statement, but hopefully you get the idea. However in the case below you will get the wrong domain.

test.com@supertest.com/something

The actual domain for this should be supertest.com.

But again, why not just use the resources already available to you? Pretty much every language has a means for parsing urls correctly.

Ex.

u, err := url.Parse("http://bing.com@another.com/search?q=dotnet")
if err != nil {
    log.Fatal(err)
}
host := u.Host
fmt.Println(host) // another.com

It's also much trickier to determine the runtime cost of a regex statement, since different inputs can have drastically different results and this can also lead to security issues (regex DoS issues, etc.).

Personally, I don't see regex as any different than many other aspects of programming. It's just another area littered with simple examples that don't prepare people for real-world cases. And because of it's flexibility it's a perfect candidate for golden hammer bias.

But it's no different than other common areas programming that are handled incorrectly (ex. service request calls without circuit breakers, etc.).

There's also less resources available for alternatives. One book that I personally found incredibly useful for better understanding string tokenization, etc. was Language Implementation Patterns (pragprog.com/titles/tpdsl/). It's geared towards language parsing, but I found the techniques it outlines are very useful to better understanding string parsing in general. I've come to use the strategies it outlines more often than I would have expected.

So to summarize, I don't think regex should be avoided. It's a useful tool and the right tool in a lot of cases. But I agree you should actively question whether it's the right tool and whether the problem would be best broken down into smaller parts that can then be regexed.

And as a sub-note, the language used can be a factor too. Languages like Elixir have awesome pattern matching abilities that can go a long way to solving problems that could also be solved with regex. Ex.

defmodule MyMod do

    def thing("this" <> second), do: thing_valid(second) 
    def thing("that" <> second), do: thing_valid(second) 
    def thing(_)               , do: IO.puts "did not have this or that."

    defp thing_valid(rest) do
        IO.puts "yay! #{rest}"
    end

end

MyMod.thing("this thing") # yay!  thing
MyMod.thing("ducks") # did not have this or that.

Anyways, thanks for sharing your thoughts! Definitely a question worth pondering.

Kelly Stannard • Aug 13 '20

I really appreciate your distinction of use cases. That was very clarifying for me.

The reason I say it should be a last resort in prod is because you want to do the safe thing in prod. Because strings can be infinitely complex it is nearly impossible to consider all possible inputs to your regexp and those inputs you failed to consider are bugs in waiting.

mattother • Aug 24 '20

Yeh, that's very true, I do agree.

My only issue is that I don't think avoidance really solves the problem. Personally I feel the development process (ex. things like code review, testing) should protect developers from making these mistakes, since eventually you are going to be a situation where they occur.

But it's definitely situationally dependent, so I can absolutely understand where you're coming from.

And, like you said, if there's another option available that's safer, why not just do that instead.

Ben Sinclair • Jul 30 '20

Hot take: regexp isn't more or less difficult than async and callbacks in Javascript.
You can do it the "too clever for your own good" way and make terse, unreadable code, or you can lay it out in a more straightforward manner if you choose to.

Kelly Stannard • Aug 1 '20

Thanks Ben. I don't think I can agree with that assessment. I may be under-informed in this area, but I have never heard of a Javascript callback hell creating a security vulnerability like what happens with regexp regularly. I have also seen plenty of devs (myself included) mess up what should have been simple and straight forward text matching cases.

DEV Community

Why not regexp?

Top comments (5)

Read next

Day 4: ASCII Art Fonts 🖋️

A beginner's guide to the Stable-Diffusion-V1-4 model by Compvis on Huggingface

Large Language Models (LLMs)

How Machine Learning Models Learn: A Journey from Basics to Foundation Models (2)