Interestingly, CloudFlare went down for a while yesterday. The post-mortem explains that the whole situation is due to a regex deployed to the WAF globally which caused a CPU overload.
How comes? Well, we don't know because the regexp isn't publicly listed, however the guess is pretty simple to make if you know that regexps have a huge drawback.
Note — This is just a guess!
One typical hard case for regex engines is
a?a?a?aaa. For 3
a like here it's easy but when you reach 20 it's already very complicated. See the following Python code for an example.
The output I get is
re.compile('a?a') = 8.5528998170048e-05 re.compile('a?a?a?a?a?aaaaa') = 0.0009152740021818317 re.compile('a?a?a?a?a?a?a?a?a?a?aaaaaaaaaa') = 0.009123567000642652 re.compile('a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaa') = 0.4090354809995915
The graph (stolen from the article I'm going to link later on in this post) looks like this:
As you can see, it's absolutely not proportional. Actually it becomes slow beyond a dozen question marks and completely crazy around twenty.
So what's happening here? Since the regex engine is backtracking, it will create a different branch every time you have a question mark. It means that it will explore all the different ways that the string could be matching. The complexity of this is factorial!
The good news is that you don't have to make slow regular expressions. There's nothing wrong in the way you write them, it's rather a question of the algorithm behind. You might have heard of Ken Thompson.
He invented a way to do regular expressions now dubbed Thompson NFA. Long story short, they turn the seconds seen before into microseconds!
It's simply a very smart optimization that slashes the complexity into something linear.
It is very long to explain and there is a very good article on the matter that already exists so I'm not going to plagiarize it.
Because some features of modern regular expression engines forbid it. But most of regular expressions don't use those features so modern engines could have two engines including one Thompson-NFA-based when possible to avoid explosive complexity. Again, see the article.
- Do you do something fancy?
- Is it performance-sensitive?
- How much control do you have over the expressions that are inserted?
If you're CloudFlare and that your regexps will be run on zillions of HTTP queries, that they look for super-weird security patterns and that are created by dozens of engineers then you should probably use it.
If you're validating an email in a form, you should probably not care.
Regular expression engines are flawed because their complexity can completely explode above the roof. As CloudFlare showed it, the repercussions can be dramatic.
The good news is that there is alternative engines that will not suffer this kind of issues. On the other hand they are not mainstream so it's a bit complicated to integrate because it will increase the maintenance surface.
It means that you have to know the flaw and every time you use a regexp you need to choose the proper engine for it. Then you might avoid a lot of drama!