The idea behind this algorithm is actually quite simple. We just do the naive search more efficiently, we make sure that we don't repeat the work that we have already done.
Let me start with an example, we are given a string
abababc and we want to check if the pattern
ababc exists in the given string. When we start matching from the beginning:
Now the character
a at index 4 in the string does not match the character
c at index 4 of the pattern. In the naive algorithm we would start the whole character matching process again from index 1 of the string (we would be matching a
ababc), but if we look at the string and the pattern do we really have to do that? We can resume the matching from the following positions:
ababc (matching resumes at index 4)
ababc (matching resumes at index 2)
We can do this because we successfully matched
abab which means there are two
ab’s in both the pattern and the string. Now we know that the last
ab in the string matches with the first
ab of the pattern and resume the matching after that.
So basically what we do is create a table that tells us where to resume our matching from in the pattern if there was a mismatch while matching the pattern and the string.
From a different perspective all we are doing is finding suffixes that are prefixes in the pattern so that if there is a mismatch after the suffix we can resume the search from the end of prefix. In the example above we there was a mismatch after we matched
ab is suffix and prefix so we can move back to index 2 in pattern which is just after the prefix
ab and resume our matching process
Read more on Wikipedia, they have a pretty good explanation and example.
Now let's code the table generation
def kmp_table(W): n = len(W) T = [0 for _ in range(n)] T = -1 pos = 0 for i in range(1, n): if W[i] == W[pos]: T[i] = T[pos] else: T[i] = pos pos = T[pos] while pos >= 0 and W[i] != W[pos]: pos = T[pos] pos += 1 return T
For building the table the idea is to have two different pointers starting at index 0 and 1, we can think of as one pointer for the prefix and another for the suffix. Basically if the characters at prefix and suffix pointers match we increment the pointers otherwise we keep changing the prefix pointer till it matches the suffix pointer or till it reaches the beginning of the pattern with the help on the table we have been building.
Now that we have our table we can start our check and use this table to get the index to resume the check
def KMP(string, substring): m = len(substring) n = len(string) T = kmp_table(substring) pos = 0 i = 0 while i < n: if string[i] != substring[pos]: pos = T[pos] if pos < 0: pos += 1 i += 1 else: i += 1 pos += 1 if pos == m: return True return False