Discussion on: Regex was taking 5 days to run. So I built a tool that did it in 15 minutes.

View post

Replies for: But that's because html isn't regular... I don't want to be "that guy", but regular expressions is definitely the lockpick, and this trie-stuff is ...

Whenever you can use re, you should, because it's the absolutely fastest and simplest you can do.

I'll leave this here:

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
Jamie Zawinski, 1997

Now, I think this is extremized, and regular expressions do have a place in development (and I also like them a lot in general), but there are very critical aspects to consider:

they have never been standardized, which means regexes in Perl are different from regexes in JavaScript, in C# and so on;
moreover, regex engines could be starkly different, as they could be either regex- or text-directed;
they're hard to read, so hard that instead of fixing a complex regular expression one could save time and headaches by rewriting them from scratch;
they're hard to use too, because there are quirks and gotchas that, if not treated correctly, could lead to disastrous performances;
they're also immensely harder to debug, because you can't run step-by-step their execution: they're basically atomic statements.

There a lot of cases where you can use regular expressions, but a lot less where you should use them. I'll also leave this:

Programs must be written for people to read, and only incidentally for machines to execute
Harold Abelson, 1984

(Now that's another hyperbole, but the point is that we have to maintain the code we write, so it's better to make out life easier.)

It's just that many re libraries out there are pretty slow and unintuitive to use...

And that's indeed another aspect to consider: many regex engines are slow to boot but we have to deal with that, because we simply just have no alternatives. And also have awkward APIs, too.

So, should we use them anyway, you say?

Daniel Beecham • Dec 12 '17

Sure, that quote is fun, but it doesn't carry a very strong point.

they have never been standardized, which means regexes in Perl are different from regexes in JavaScript, in C# and so on;

The language and what you can do with it is standardized - unions, differences, kleene stars and so on. The rest is just syntax.

moreover, regex engines could be starkly different, as they could be either regex- or text-directed;

A well designed regex engine is a finite automata. That's it.

they're hard to read, so hard that instead of fixing a complex regular expression one could save time and headaches by rewriting them from scratch;

I don't think this is too hard to read.

they're hard to use too, because there are quirks and gotchas that, if not treated correctly, could lead to disastrous performances;

Maybe.

they're also immensely harder to debug, because you can't run step-by-step their execution: they're basically atomic statements.

Of course you can step-by-step a RE. It's just a finite automata; just step though it.

And that's indeed another aspect to consider: many regex engines are slow to boot but we have to deal with that, because we simply just have no alternatives. And also have awkward APIs, too.

But we do have alternatives. Ragel, as mentioned above, is a really good one. re2 for python is supposedly good. The rust regex is good. Alex is also pretty good.

Massimo Artizzu • Dec 12 '17

The language and what you can do with it is standardized - unions, differences, kleene stars and so on. The rest is just syntax.

Not so easy. The point is that you can't transpose a regular expression from a different language without any second thinking. It means that regular expressions are another layer of programming language that you have to take into consideration.

A well designed regex engine is a finite automata. That's it.

A computer doesn't care what a regex is. You shouldn't either, as it doesn't make any difference. The implementation can be very different and something that should be cared about.

I don't think this is too hard to read.

That's not a regular expression: it's a list of definitions used by a regex engine.

Of course you can step-by-step a RE. It's just a finite automata; just step though it.

Only if you have a library that replicate a regex engine. In many languages, you just use the regular expression engine that's natively implemented, because any non-native solution is usually several orders of magnitude slower, it's basically a reinvented wheel, and it just adds another dependency to the project.

If it's slower and takes additional configuration, the "fastest and simplest you can do" part simply disappears.

But we do have alternatives.

In some languages maybe they're worth considering. But nobody uses a custom regex engine in JavaScript, or in PHP, or in Python, except for very limited cases.