How to perform regex search queries over billions of strings

#devto #community #announcement #contributorswanted

But why would you do this??

Let's say, you develop one of the popular cloud services (something like AWS Lambda). Your service is being used by a million clients which created tens of millions of instances of it. Like any other cloud service your service has some structured configs. Configs typically look like a bunch of key value strings. Something like:

...
output_path=s3://my_production_bucket
output_path=s3://my_test_bucket
output_path=s3://blackholepath
...

In total you have billions of such key value strings over all configs. This results in almost a terabyte of key value strings.

An an owner of such a popular service it's your job to maintain configs as well as a service: introduce new parameters, deprecate old ones, monitor how often certain features are used etc. Doing this job would be way easier if you had a tool that allowed you to quickly perform a query: "Show all key value pairs matching ${regular_expression}".

Here I would like to emphasize the importance of quickly show. For manual maintenance tasks, like checking an invariant for all key value pairs, the latency might not be crucial. Mainly because it's developers who are faced with this latency. It becomes way more important when this latency is exposed to clients. It isn't that hard to imagine such situation. For example it could prove useful to validate some global invariants (expressed as number of regular expression matches) on configs that need to be checked before a config is committed to the service by client.

Ok, so what solutions are available out of the box?

There's surely plenty of databases that allow regex match search: [1], [2], etc.

Most of them though don't do it quickly and require the whole scan of database. There some solutions to speed this up like [3] and [4]. But ultimately in the most unconstrained form regular expression search requires a scan of the whole database.

There's a trick though!!

If most of your regex queries are generated by humans they likely have long sub-strings without special characters. Even more so, likely those sub-strings must be a part of the matched string. For example:

.*_path=s3.*
.*id=[0-9]{1,4}order

Every string that matches the first regex must have _path=s3 as a sub-string. Every string that matches the second expression must have both id= and order as sub-strings.

This isn't a bad assumption. If your developers / clients are writing the queries to perform routine / automated search tasks it will likely be true. At the same time it won't work for all of the queries. Hopefully you can design UX around it in a way that the expectation is clear: if the query doesn't conform to optimization requirements the latency won't be good.

In any case if the assumption is true this is great as sub-string indexes are very well supported by databases. At the same time sub-string indices based on n-grams (those are widely supported) can be very naturally distributed (basically you separate sets of n-grams into separate nodes).

Let's look into how much this will reduce your search space approximately. Let's assume you have only a single three letter sub-string that must appear in your query result and that 3-grams are uniformly distributed (this of course isn't true in practice but is ok for a back of an envelope calculation). In this case it'll reduce the search by roughly 36 (letters of alphabet plus digits) to the power of 3. So potentially reducing the search space by 46656 times. This is 4.5 orders of magnitude!

Implementation tips

I won't try and explain in detail on how to implement this here. Instead I leave the links to all the relevant pieces of information.

To get all the sub-strings that must occur in the matched string you need to parse the regular expression into some syntax tree and then analyze it. This thread [5] gives an idea on how to do it.

Building an n-gram index is supported in many databases. If you already have a database you'll likely be able to just configure it there. Here's an example of one [6]. If you don't have one and do not want to depend on yet another database it isn't too hard to implement one based just on SStables [7].

Final thoughts

The general idea here is that looking for patterns in the syntax trees of your regular expression queries can help quite a bit with optimizations of their speed.

Even if some your queries don't have a string that must appear in all of the results maybe there are some other common patterns. For example, queries like (one)|(two) don't have a common sub-string but at the same time they are easily optimizable by the same sub-string n-gram index.

Thanks for reading and let me know what you think about it in the comments!