Discussion on: Why not regexp?

View post

I would both agree and disagree. I think regex is the right tool for the job in a lot of cases. However, I think it's common to abuse regex, as you alluded to and it's also common to misunderstand regexes applications. I think it's the perfect example of a leaky abstraction.

To me regex is inherently bad at representing sub-structures in complex cases. And the more conditional permutations that exist within the matching domain, the harder it becomes to write a valid regex statement (and even harder to write a clear regex statement).

Part of the issue is there's no clarity of the business logic handled by a regex statement, so if it does too much, it becomes unclear what cases it handles. Tests can help, but personally I find handling a simpler domain is usually a better solution.

Also like most programming examples, regex examples are usually not production ready and are really not representative of reality. Also given that most regex resources are out-of-date, none of them generally address unicode (or emojis).

Urls are a good example. Matching general urls with regex is a terrible idea in my opinion. For two reasons, one because what constitutes a valid url is often misunderstand and two because there are generally already better resources available for doing this.

Case and point from urlregex.com, the 'The Perfect URL Regular Expression', doesn't even function properly.

Here's the python matching regex:

http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+

And here's a simple example where it breaks (since it does not handle unicode):

test.ca/so/önicode

Here's another example I've seen come up quite a bit in projects. Somebody will write something like below to match the domain:

/^https?:\/\/([a-z.]+)/

This is obviously a terrible regex statement, but hopefully you get the idea. However in the case below you will get the wrong domain.

test.com@supertest.com/something

The actual domain for this should be supertest.com.

But again, why not just use the resources already available to you? Pretty much every language has a means for parsing urls correctly.

Ex.

u, err := url.Parse("http://bing.com@another.com/search?q=dotnet")
if err != nil {
    log.Fatal(err)
}
host := u.Host
fmt.Println(host) // another.com

It's also much trickier to determine the runtime cost of a regex statement, since different inputs can have drastically different results and this can also lead to security issues (regex DoS issues, etc.).

Personally, I don't see regex as any different than many other aspects of programming. It's just another area littered with simple examples that don't prepare people for real-world cases. And because of it's flexibility it's a perfect candidate for golden hammer bias.

But it's no different than other common areas programming that are handled incorrectly (ex. service request calls without circuit breakers, etc.).

There's also less resources available for alternatives. One book that I personally found incredibly useful for better understanding string tokenization, etc. was Language Implementation Patterns (pragprog.com/titles/tpdsl/). It's geared towards language parsing, but I found the techniques it outlines are very useful to better understanding string parsing in general. I've come to use the strategies it outlines more often than I would have expected.

So to summarize, I don't think regex should be avoided. It's a useful tool and the right tool in a lot of cases. But I agree you should actively question whether it's the right tool and whether the problem would be best broken down into smaller parts that can then be regexed.

And as a sub-note, the language used can be a factor too. Languages like Elixir have awesome pattern matching abilities that can go a long way to solving problems that could also be solved with regex. Ex.

defmodule MyMod do

    def thing("this" <> second), do: thing_valid(second) 
    def thing("that" <> second), do: thing_valid(second) 
    def thing(_)               , do: IO.puts "did not have this or that."

    defp thing_valid(rest) do
        IO.puts "yay! #{rest}"
    end

end

MyMod.thing("this thing") # yay!  thing
MyMod.thing("ducks") # did not have this or that.

Anyways, thanks for sharing your thoughts! Definitely a question worth pondering.

Kelly Stannard • Aug 13 '20

I really appreciate your distinction of use cases. That was very clarifying for me.

The reason I say it should be a last resort in prod is because you want to do the safe thing in prod. Because strings can be infinitely complex it is nearly impossible to consider all possible inputs to your regexp and those inputs you failed to consider are bugs in waiting.

mattother • Aug 24 '20

Yeh, that's very true, I do agree.

My only issue is that I don't think avoidance really solves the problem. Personally I feel the development process (ex. things like code review, testing) should protect developers from making these mistakes, since eventually you are going to be a situation where they occur.

But it's definitely situationally dependent, so I can absolutely understand where you're coming from.

And, like you said, if there's another option available that's safer, why not just do that instead.