Introduction
Hi everyone! Today we're going to discuss a very interesting algorithm that is very handy in the string matching problems.
Consider the following problem: Given the string text
= "cvabcg", and the string pattern
= "abc". Find whether the pattern
exists inside of the text
, and return the index of the first matching character.
In the previous example, the answer should be
, as we can find the pattern
"abc" starting from index 2 inside the text
string.
The Naïve Way
The simplest way to solve this is to loop through each character of the text
string, then check if the pattern
starts from that character index or not.
int patternMatching(string text, string pattern) {
int n = text.size(), m = pattern.size();
for (int i = 0; i < n; i++) {
bool flag = true;
int idx = i;
for (int j = 0; j < m; j++) {
if (pattern[j] != text[idx]) {
flag = false;
break;
} else {
idx++;
}
}
if (flag) {
return i;
}
}
return -1;
}
Unfortunately, the time complexity of this approach is
, where
is the length of the text
string, and
is the length of the pattern
string.
The Efficient Way (Using KMP Algorithm)
The KMP algorithm helps solving this problem in just . It consists of two stages:
- Constructing the LPS (AKA: PI) table.
- Pattern matching.
1. Constructing the LPS/PI table.
LPS stands for longest proper prefix that is also a suffix.
This is the preprocessing step in our algorithm.
= the longest proper prefix of the substring which is also a suffix of the substring .
Note: The prefix and suffix we mentioned can overlap.
- Proper prefix of a string means that the prefix is not equal to the string itself. Which indicates that always (substring of length 1 can't have a proper prefix).
Examples
1. Consider string "ababcd"
a | b | a | b | c | d |
---|---|---|---|---|---|
0 | 0 | 1 | 2 | 0 | 0 |
Explanation: Examine the following substrings:
a > Not a proper prefix, hence lps[i] = 0
ab > prefix != suffix, hence lps[i] = 0
aba > prefix 'a' == suffix 'a', hence lps[i] = 1
abab > prefix 'ab' == suffix 'ab', hence lps[i] = 2
ababc > prefix != suffix, hence lps[i] = 0
ababcd > prefix != suffix, hence lps[i] = 0
2. Consider string "abcabcabc"
a | b | c | a | b | c | a | b | c |
---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 1 | 2 | 3 | 4 | 5 | 6 |
Explanation: Examine the following substrings:
a > Not a proper prefix, hence lps[i] = 0
ab > prefix != suffix, hence lps[i] = 0
abc > prefix != suffix, hence lps[i] = 0
abca > prefix 'a' == suffix 'a', hence lps[i] = 1
abcab > prefix 'ab' == suffix 'ab', hence lps[i] = 2
abcabc > prefix 'abc' == suffix 'abc', hence lps[i] = 3
abcabca > prefix 'abca' == suffix 'abca', hence lps[i] = 4 > Overlapping
abcabcab > prefix 'abcab' == suffix 'abcab', hence lps[i] = 5 > Overlapping
abcabcabc > prefix 'abcabc' == suffix 'abcabc', hence lps[i] = 6 > Overlapping
Algorithm
Let's define prefixMatchedEndIdx
which marks the end index of the current prefix that matches the current suffix.
If the new string character (new suffix character) matches the character at
prefixMatchedEndIdx
, then we extend the current prefix length by one.If the new suffix character doesn't match the character at
prefixMatchedEndIdx
, and theprefixMatchedEndIdx
is actually the first character of the string, then there is no way this suffix matches any prefix.If the new suffix character doesn't match the character at
prefixMatchedEndIdx
, butprefixMatchedEndIdx
is NOT the first character of the string, then there is a chance we can find a smaller prefix that can match the current suffix.
How much reduction we need to the prefix length?
- We reduce it to be the lps (longest prefix that is also a suffix) of the previous prefix length.
- In other words, we reduce it to the previous prefix length that matched some suffix, so that there is a chance of the new smaller prefix matching some suffix.
Consider the follwoing string: "aabaaac"
i = 4: "aabaa | ac" > lps[i] = 2, prefixMatchedEndIdx = 2
i = 5: "aabaaa | c" > pattern[prefixMatchedEndIdx] != pattern[i]
> prefixMatchedEndIdx = lps[prefixMatchedEndIdx - 1] = 1
vector<int> computeLps(string pattern){
int m = pattern.size();
vector<int> lps (pattern.size(), 0);
int prefixMatchedEndIdx = 0, i = 1;
while(i < m) {
if (pattern[i] == pattern[prefixMatchedEndIdx]) {
prefixMatchedEndIdx ++;
lps[i] = prefixMatchedEndIdx;
i++;
} else if (prefixMatchedEndIdx == 0) {
lps[i] = prefixMatchedEndIdx;
i++;
} else {
prefixMatchedEndIdx = lps[prefixMatchedEndIdx - 1];
}
}
return lps;
}
- Time Complexity: , where is the length of the string we are preprocessing.
2. Pattern matching
After knowing how to construct the lps
array, we want to check whether a pattern exists inside of text
.
Concatenate both the pattern and the text (starting with the pattern), and use a separator between them that is not a character of the two strings (such as
$
).Compute the
lps
array for the new string.-
If we find an element with an lps value == length of the pattern, it means that there is a suffix (substring from
text
) that is equal to the prefix (pattern
), and since the lps value is equal to the length of the pattern, it means that we found the entire pattern inside the text.- In this case we want to return the index of the first matching character of the pattern inside the text; which in this case:
i - (m + 1) - (m - 1)
.- : current index in the concatenated string.
-
: length of the
pattern
string. -
: length of the
pattern
string + the separator. - : to get to the first character of the matched string.
- In this case we want to return the index of the first matching character of the pattern inside the text; which in this case:
int patternMatching(string text, string pattern) {
int n = text.size(), m = pattern.size();
string concat = pattern + "$" + text;
vector<int> lps = computeLps(concat);
for (int i = m + 1; i <= n + m; i++) { /* m + 1 is where the text starts */
if (lps[i] == m) {
// m + 1, to skip the concatenated pattern with the separator $
// m - 1, to get the index at the start of the string
return i - (m + 1) - (m - 1);
}
}
return -1;
}
-
Time Complexity:
, where
is the length of the
text
string.
That's it, now you just added a new interesting algorithm to your toolbox. I hope this was helpful, and let me know your thoughts in the comments.
Top comments (1)
That should be declared:
otherwise you make pointless copies of the arguments. In general, you should use
size_t
instead ofint
— otherwise you can get truncation warnings.You also didn't mention that another nice thing about KMP is that it never has to backtrack in the input which means it works for streaming applications.