Discussion on: The algorithm behind Ctrl + F.

View post

Hey Akhil! Amazing article! I had always wondered how is chrome able to find matching strings so fast.
One question, is this type of pattern matching slower or faster than using Regular Expressions (say in python)?

ZaneHannanAU • Jul 8 '20

Regexes are generally functions, with arbitrary rules as to their composition. For example, "all 10 or 11 digit phone numbers starting with +91" might be expressed in regex as "(?:\+91|\(\+91\))[ -]?\d{3}[ -]?\d{3}[ -]?\d{3}", but a compiled function might use many other tricks to find its way through the document, be it a trie (moderately memory intensive) or whatever.

Regexes are just a simplified means of expressing functions, with their own grammatical structure.

Simple byte lookups (especially in an ASCII or ASCII compatible document) are dozens if not hundreds or thousands of times faster than composition of a function like that.

Akhil • Jul 8 '20

Yea, it depends on various factors like how many time regex is being executed etc.

Eg : If your string is 'QABC' and pattern is 'ABC' then the naive algorithm will perform better.

I read somewhere about the progress being made in fast string matching with regex using pattern matching algorithms with them.

ZaneHannanAU • Jul 8 '20

That first one is true in js, but in most languages it's false.

Using regex to find string matches is still quite slow, but does work fairly well. In a compiled language, like rust, c, or go, it will be quite consistent, and have a constant time (unless gc interrupts it).

The short of it is: avoid regexes where possible. There are many premade solutions available.

Akhil • Jul 7 '20

I am not sure about its speed in python, maybe StackOverflow might help with that but overall regex are faster for dynamic situations like finding all phones numbers and algorithms might be faster for finding a particular phone number in a record of million phone numbers.

Sam Wight • Jul 8 '20

It depends on the complexity of the regex you're searching with. Regex engines usually build some sort of state machine from your regex, sort of like a compiler. Then they use that to check against the string. A state-machine based implementation will just be slower because of all of the extra variable changes it's having to do, but with simple enough regexes the compiled state machine might be fairly similar to this.

Regexes also aren't guaranteed to be much faster than a hand-written implementation for things like finding phone numbers, etc. They're pretty darn fast, don't get me wrong, but there might possibly be optimizations that could be made for specific use cases. Regex is just a convenient abstraction. Much like Javascript or Python, it's "just fast enough" for the vast majority of use cases, but for some like these, you need a better implementation for it to be performant.