DEV Community

Akhil
Akhil

Posted on • Updated on

The algorithm behind Ctrl + F.

Ctrl + F on chrome opens up a search box that is used to find text on a web page, pdf, etc. It's one of the fastest ones I have seen and decided to dig deeper into what's going on.

So let's go on a journey of implementing a fast string matching algorithm.

Note: The algorithm which we will implement might be similar to the one used in Chrome, but since its Google we're talking about, they might've made optimizations

You might be wondering why do we need an algorithm when we have regular expression which does the same?

Yes we have regular expressions at our disposal but regular expressions are slow when we task it with finding patterns on large data, regular expression is awesome when we task it with finding a "dynamic patter" like all 10 digit phone numbers starting with +91, but in this case, we want to find one particular string.

If you want to know more Read here

This leaves us the only option of implementing a pattern matcher. Let's start with basic we can think about. We're given a document containing millions of words and we want to find one word, how shall we approach this? It's like finding a needle in a haystack.

Naive Approach

The first idea that we think of is comparing pattern and string character by character :

Alt Text

Implementation :

let string = "ATAATTACCAACATC";
let pattern = "ATC";
let position = [];
let found = true;
for(let i=0;i<string.length;i++){
  found = true;
  for(let j=0;j<pattern.length;j++){
    if(string[i+j] != pattern[j]){
      found = false;
      break;
    }
  }
  if(found){
    position.push(i);
  }
}

console.log(position);
Enter fullscreen mode Exit fullscreen mode

But this performs in O(nm) time complexity, which is very slow.

How to optimize it?

For each string, if it doesn't match, we move by one character. How about skipping the whole word?

Alt Text

In this case, instead of starting all over again, we skip the string when it mismatches.

In the previous approach, we compared string nearly 45 times, here we compared string only 15 times which is a huge leap.

Here we can perform an optimization, instead of comparing from the front, how about comparing from the end?

Alt Text

In this case, we compared the string just 9 times, which is nearly half of the previous case.

But as you might've guessed this has a huge flaw of, what if the end characters match but starting character mismatch.

So we need a concrete algorithm that will skip characters such that overall character comparison decreases.

What other options do we have?

One thing which we could do is instead of moving the entire pattern, we move a part of the pattern.

We match each character between mismatched string and pattern, then we check if we have any common characters, if we do then we move only part of those characters.

Alt Text

In this case, we did 12 comparison operations and this will work if compare string and pattern from either side.

This algorithm is called the Boyer Moore Pattern Matching algorithm.

Implementation of Boyer Moore Pattern Matching algorithm

This is a modified version of the original algorithm, the original algorithm found only the first instance of the pattern, here we're finding all the occurrences of the pattern.

Step 1> create an empty map of size 256 (because 256 ASCII characters) and set to -1.

let string = "ATAATTACCAACATCATAATTACCAACATCATAATTACCAACATCATAATTACCAACATCATC";
let pattern = "ATC";

let M = pattern.length;
let N = string.length;
let skip;                            //to determine substring skip
let res = [];                        //to store result

let map = new Array(256);            //array of 256 length

Enter fullscreen mode Exit fullscreen mode

Step 2> Map character to its index in the pattern.

for(let c = 0;c<256;c++){
  map[c] = -1;                       //initialize to -1
}

for(let j=0;j<M;j++){
  map[pattern[j]] = j;               //initialize to the it's index in pattern
}
Enter fullscreen mode Exit fullscreen mode

Step 3> Loop over the string, notice that in the for loop, instead of "i++", we're using i+= skip, ie skip that part of the string.

for(let i=0;i<=N-M;i+=skip)
Enter fullscreen mode Exit fullscreen mode

Step 4> Set skip to 0 during each iteration, this is important.

for(let i=0;i<=N-M;i+=skip){
  skip=0;
}
Enter fullscreen mode Exit fullscreen mode

Step 5> Match pattern with string.

for(let i=0;i<=N-M;i+=skip){
  skip=0;
  for(let j = M-1;j>=0;j--){

    if(pattern[j] != string[i+j]){
      skip = Math.max(1,j-map[string[i+j].charCodeAt(0)]);
      break;
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Step 6> If there's a mismatch, find the length that must be skipped, here we perform

   skip = Math.max(1,j-map[string[i+j]]);
Enter fullscreen mode Exit fullscreen mode

In some cases like eg : "ACC" and "ATC", in these cases the last character's match but rest do not.
Logically we must go back and match first "C" of the string with "C" of pattern, but doing so will mean that we are going back which we logically shouldn't or else we will be stuck in an infinite loop going back and forth.
To ensure that we keep on going forward with the matching process, we ensure that whenever we come across situations when there's a negative skip, we set skip to 1.

Step 7> If the skip is 0, ie there we no mismatch, add "i" to the result list.

if(skip == 0){
    console.log(skip)
    res.push(i);
    skip++;
  }
Enter fullscreen mode Exit fullscreen mode

Combining them all :

let string = "ATAATTACCAACATCATAATTACCAACATCATAATTACCAACATCATAATTACCAACATCATC";
let pattern = "ATC";

let M = pattern.length;
let N = string.length;
let skip;
let res = [];

let map = new Array(256);

for(let c = 0;c<256;c++){
  map[c] = -1;
}

for(let j=0;j<M;j++){
  map[pattern[j]] = j;
}

for(let i=0;i<=N-M;i+=skip){
  skip=0;
  for(let j = M-1;j>=0;j--){

    if(pattern[j] != string[i+j]){
      skip = Math.max(1,j-map[string[i+j].charCodeAt(0)]));
      break;
    }
  }
  if(skip == 0){
    res.push(i);
    skip++;
  }
}

console.log(res);
Enter fullscreen mode Exit fullscreen mode

That's it! That's how Boyer Moore's pattern matching works.

There are many other Pattern Matching algorithms like Knuth Morris Pratt and Rabin Karp but these have their own use cases.

I found this on StackOverflow you can read it here but in a nutshell:

Boyer Moore : Takes O(m) space, O(mn) worst case ,best case Ξ©(m/n). preforms 25% better on dictionary words and long words. Pratcial usecase includes implementation of grep in GNU for string matching, chrome probably uses it for string search.

Knuth Morris Pratt: Takes O(m) space, O(m+n) worst case, works better on DNA sequences.

Rabin Karp: Use O(1) auxiliary space, this performs better under searching for long words in a document containing many long words (see StackOverflow link for more).

I hope you liked my explanation. I usually write about how to solve interview questions and real-life applications of algorithms.

If I messed up somewhere or explained something wrongly, please comment down below.

Thanks for reading! :)

github:https://github.com/AKHILP96/Data-Structures-and-Algorithms/blob/master/Algorithm/boyermoore.js

PS: I am looking for job, if you want someone who knows's how to design UI/UX while keeping development in mind, hit me up :) thanks!

Buy Me A Coffee

Latest comments (49)

Collapse
 
kaznarah1 profile image
Kaznarah1

thanks

Collapse
 
nageshwaruideveloper profile image
Nageshwar Reddy Pandem

Good Explanation

Collapse
 
eddiej profile image
Eddie Eddie

Great article, now it makes me thinking further. How "Search in File.." in Intellij work? :D

Collapse
 
arcticspacefox profile image
ArcticSpaceFox

Nice profile picture of gilfoyle πŸ˜‚ cool post very interesting for a newbe

Collapse
 
akhilpokle profile image
Akhil

Thanks for reading :)

Collapse
 
dule_martins profile image
Dule Martins

At first, while reading I was confused about the time it takes to match each pattern.

Nice read, let me try implementing it.

Collapse
 
akhilpokle profile image
Akhil

Thanks for reading :)

Collapse
 
brugui7 profile image
Alejandro Brugarolas

Amazing post!!!!

I'll left here an implementation of BM and KMP I did in C for an university task just in case someone want to dig deeper on this.

github.com/Brugui7/Algorithmics/tr...

Collapse
 
akhilpokle profile image
Akhil

That's awesome! Thanks for reading :)

Collapse
 
akhilpokle profile image
Akhil

Thanks for reading :)

Collapse
 
abcdan profile image
DaniΓ«l

Really interesting!

Collapse
 
ekdnam profile image
Aditya Mandke

Hey Akhil! Amazing article! I had always wondered how is chrome able to find matching strings so fast.
One question, is this type of pattern matching slower or faster than using Regular Expressions (say in python)?

Collapse
 
samwightt profile image
Sam Wight

It depends on the complexity of the regex you're searching with. Regex engines usually build some sort of state machine from your regex, sort of like a compiler. Then they use that to check against the string. A state-machine based implementation will just be slower because of all of the extra variable changes it's having to do, but with simple enough regexes the compiled state machine might be fairly similar to this.

Regexes also aren't guaranteed to be much faster than a hand-written implementation for things like finding phone numbers, etc. They're pretty darn fast, don't get me wrong, but there might possibly be optimizations that could be made for specific use cases. Regex is just a convenient abstraction. Much like Javascript or Python, it's "just fast enough" for the vast majority of use cases, but for some like these, you need a better implementation for it to be performant.

Collapse
 
zanehannanau profile image
ZaneHannanAU

Regexes are generally functions, with arbitrary rules as to their composition. For example, "all 10 or 11 digit phone numbers starting with +91" might be expressed in regex as "(?:\+91|\(\+91\))[ -]?\d{3}[ -]?\d{3}[ -]?\d{3}", but a compiled function might use many other tricks to find its way through the document, be it a trie (moderately memory intensive) or whatever.

Regexes are just a simplified means of expressing functions, with their own grammatical structure.

Simple byte lookups (especially in an ASCII or ASCII compatible document) are dozens if not hundreds or thousands of times faster than composition of a function like that.

Collapse
 
akhilpokle profile image
Akhil

Yea, it depends on various factors like how many time regex is being executed etc.

Eg : If your string is 'QABC' and pattern is 'ABC' then the naive algorithm will perform better.

I read somewhere about the progress being made in fast string matching with regex using pattern matching algorithms with them.

Thread Thread
 
zanehannanau profile image
ZaneHannanAU

That first one is true in js, but in most languages it's false.

Using regex to find string matches is still quite slow, but does work fairly well. In a compiled language, like rust, c, or go, it will be quite consistent, and have a constant time (unless gc interrupts it).

The short of it is: avoid regexes where possible. There are many premade solutions available.

Collapse
 
akhilpokle profile image
Akhil

I am not sure about its speed in python, maybe StackOverflow might help with that but overall regex are faster for dynamic situations like finding all phones numbers and algorithms might be faster for finding a particular phone number in a record of million phone numbers.

Collapse
 
kshcode profile image
SeongHoon Kim • Edited

That code does not work for a string which is consisting of only the same character. (e.g. pattern = 'TT')

so, after replacing 'skip = Math.max(1,j-map[string[i+j]]);' with 'skip = Math.max(1,j-map[string[i+j].charCodeAt(0)]);', working correctly.

Collapse
 
akhilpokle profile image
Akhil

Thanks for pointing out :) Code updated!

Collapse
 
kshcode profile image
SeongHoon Kim

well, the combined code is still the same as before.

Collapse
 
ylucet profile image
Yves Lucet

Boyer-Moore and KMP are both O(m+n) in the worst case. Please fix the typo (or check your references).

Collapse
 
akhilpokle profile image
Akhil • Edited

Worst case is still O(mn).
Read this : cs.cornell.edu/courses/cs312/2002s...

Collapse
 
maowtm profile image
maowtm

KMP is definitely O(m+n) even in worst case, because after the table construction (O(m)) it's just a linear scan on the string (O(n)).

Thread Thread
 
akhilpokle profile image
Akhil

Thanks for sharing! Updated!

Collapse
 
ylucet profile image
Yves Lucet

Agreed, but your article shows O(mn).

Thread Thread
 
akhilpokle profile image
Akhil

This algorithm works well if the alphabet is reasonably big, but not too big. If the last character usually fails to match, the current shift s is increased by m each time around the loop. The total number of character comparisons is typically about n/m, which compares well with the roughly n comparisons that would be performed in the naive algorithm for similar problems. In fact, the longer the pattern string, the faster the search! However, the worst-case run time is still O(nm). The algorithm as presented doesn't work very well if the alphabet is small, because even if the strings are randomly generated, the last occurrence of any given character is near the end of the string. This can be improved by using another heuristic for increasing the shift. Consider this example:

T = ...LIVID_MEMOIRS...
P = EDITED_MEMOIRS

Collapse
 
hem profile image
Hem

This is interesting !

Collapse
 
akhilpokle profile image
Akhil

Thanks for reading :)

Collapse
 
johnphamous profile image
John Pham • Edited

Great write up on the different string matching algorithms!

The algorithm which we will implement might be similar to the one used in Chrome, but since its Google we're talking about, they might've made optimizations

The source code for the find in page is actually open source! You can see how it's implemented here: source.chromium.org/chromium/chrom...

Actual implementation: https://source.chromium.org/chromium/chromium/src/+/master:v8/src/strings/string-search.h;l=281;drc=e3355a4a33909a48ebb8614048d90cffc67d287e?q=string%20search&ss=chromium&originalUrl=https:%2F%2Fcs.chromium.org%2F

Looks like they swap in different algorithms based on the context.

Collapse
 
akhilpokle profile image
Akhil

OMG ! thanks for sharing :). I read on StackOverflow that somewhere around 2008-2010 when chrome was picking of, they shared how they've implemented a version of Boyer Moore's pattern matching algorithm. But I couldn't find it on youtube.

Collapse
 
ben profile image
Ben Halpern

Very cool!

Collapse
 
akhilpokle profile image
Akhil

OMG !! Thanks for reading πŸ™πŸ™πŸ™

Collapse
 
ben profile image
Ben Halpern

πŸ€“

Collapse
 
fiqrisr profile image
Fiqri Syah Redha

This is interesting. Thanks for sharing.

Collapse
 
akhilpokle profile image
Akhil

Thanks alot for reading :)