Abdelrahman AbouDeghedy

Posted on Aug 21, 2023

Pattern Matching Using Knuth–Morris–Pratt (KMP) Algorithm

#algorithms #cpp #datastructures #tutorial

Introduction

Hi everyone! Today we're going to discuss a very interesting algorithm that is very handy in the string matching problems.

Consider the following problem: Given the string text = "cvabcg", and the string pattern = "abc". Find whether the pattern exists inside of the text, and return the index of the first matching character.

In the previous example, the answer should be $2$ , as we can find the pattern "abc" starting from index 2 inside the text string.

The Naïve Way

The simplest way to solve this is to loop through each character of the text string, then check if the pattern starts from that character index or not.

int patternMatching(string text, string pattern) {
    int n = text.size(), m = pattern.size();

    for (int i = 0; i < n; i++) {
        bool flag = true;
        int idx = i;
        for (int j = 0; j < m; j++) {
            if (pattern[j] != text[idx]) {
                flag = false;
                break;
            } else {
                idx++;
            }
        }

        if (flag) {
            return i;
        }
    }

    return -1;
}

Unfortunately, the time complexity of this approach is $O (n * m)$ , where $n$ is the length of the text string, and $m$ is the length of the pattern string.

The Efficient Way (Using KMP Algorithm)

The KMP algorithm helps solving this problem in just $O (n + m)$ . It consists of two stages:

Constructing the LPS (AKA: PI) table.
Pattern matching.

1. Constructing the LPS/PI table.

LPS stands for longest proper prefix that is also a suffix.
This is the preprocessing step in our algorithm.
$\text{lps}[i]$ = the longest proper prefix of the substring $\text{pattern}[0,\dots i]$ which is also a suffix of the substring $\text{pattern}[0,\dots i]$ .

Note: The prefix and suffix we mentioned can overlap.

Proper prefix of a string means that the prefix is not equal to the string itself. Which indicates that $\text{lps}[0] = 0$ always (substring of length 1 can't have a proper prefix).

Examples

1. Consider string "ababcd"

a	b	a	b	c	d
0	0	1	2	0	0

Explanation: Examine the following substrings:
a      > Not a proper prefix, hence lps[i] = 0
ab     > prefix != suffix, hence lps[i] = 0
aba    > prefix 'a' == suffix 'a', hence lps[i] = 1
abab   > prefix 'ab' == suffix 'ab', hence lps[i] = 2
ababc  > prefix != suffix, hence lps[i] = 0
ababcd > prefix != suffix, hence lps[i] = 0

2. Consider string "abcabcabc"

a	b	c	a	b	c	a	b	c
0	0	0	1	2	3	4	5	6

Explanation: Examine the following substrings:
a         > Not a proper prefix, hence lps[i] = 0
ab        > prefix != suffix, hence lps[i] = 0
abc       > prefix != suffix, hence lps[i] = 0
abca      > prefix 'a' == suffix 'a', hence lps[i] = 1
abcab     > prefix 'ab' == suffix 'ab', hence lps[i] = 2
abcabc    > prefix 'abc' == suffix 'abc', hence lps[i] = 3
abcabca   > prefix 'abca' == suffix 'abca', hence lps[i] = 4 > Overlapping
abcabcab  > prefix 'abcab' == suffix 'abcab', hence lps[i] = 5 > Overlapping
abcabcabc > prefix 'abcabc' == suffix 'abcabc', hence lps[i] = 6 > Overlapping

Algorithm

Let's define prefixMatchedEndIdx which marks the end index of the current prefix that matches the current suffix.

If the new string character (new suffix character) matches the character at prefixMatchedEndIdx, then we extend the current prefix length by one.
If the new suffix character doesn't match the character at prefixMatchedEndIdx, and the prefixMatchedEndIdx is actually the first character of the string, then there is no way this suffix matches any prefix.
If the new suffix character doesn't match the character at prefixMatchedEndIdx, but prefixMatchedEndIdx is NOT the first character of the string, then there is a chance we can find a smaller prefix that can match the current suffix.

How much reduction we need to the prefix length?

We reduce it to be the lps (longest prefix that is also a suffix) of the previous prefix length.
- In other words, we reduce it to the previous prefix length that matched some suffix, so that there is a chance of the new smaller prefix matching some suffix.

Consider the follwoing string: "aabaaac"
i = 4: "aabaa  | ac"  > lps[i] = 2, prefixMatchedEndIdx = 2
i = 5: "aabaaa | c"   > pattern[prefixMatchedEndIdx] != pattern[i]
        > prefixMatchedEndIdx = lps[prefixMatchedEndIdx - 1] = 1

vector<int> computeLps(string pattern){
    int m = pattern.size();
    vector<int> lps (pattern.size(), 0);
    int prefixMatchedEndIdx = 0, i = 1;

    while(i < m) {
        if (pattern[i] == pattern[prefixMatchedEndIdx]) {
            prefixMatchedEndIdx ++;
            lps[i] = prefixMatchedEndIdx;
            i++;
        } else if (prefixMatchedEndIdx == 0) {
            lps[i] = prefixMatchedEndIdx;
            i++;
        } else {
            prefixMatchedEndIdx = lps[prefixMatchedEndIdx - 1];
        }
    }

    return lps;
}

Time Complexity: $O(m)$ , where $m$ is the length of the string we are preprocessing.

2. Pattern matching

After knowing how to construct the lps array, we want to check whether a pattern exists inside of text.

Concatenate both the pattern and the text (starting with the pattern), and use a separator between them that is not a character of the two strings (such as $).
Compute the lps array for the new string.
If we find an element with an lps value == length of the pattern, it means that there is a suffix (substring from text) that is equal to the prefix (pattern), and since the lps value is equal to the length of the pattern, it means that we found the entire pattern inside the text.
- In this case we want to return the index of the first matching character of the pattern inside the text; which in this case: i - (m + 1) - (m - 1).
  - $i$ : current index in the concatenated string.
  - $m$ : length of the pattern string.
  - $m + 1$ : length of the pattern string + the separator.
  - $m - 1$ : to get to the first character of the matched string.

int patternMatching(string text, string pattern) {
    int n = text.size(), m = pattern.size();

    string concat = pattern + "$" + text;
    vector<int> lps = computeLps(concat);

    for (int i = m + 1; i <= n + m; i++) { /* m + 1 is where the text starts */
        if (lps[i] == m) {
            // m + 1, to skip the concatenated pattern with the separator $
            // m - 1, to get the index at the start of the string
            return i - (m + 1) - (m - 1);
        }
    }

    return -1;
}

Time Complexity: $O(n)$ , where $n$ is the length of the text string.

That's it, now you just added a new interesting algorithm to your toolbox. I hope this was helpful, and let me know your thoughts in the comments.

Top comments (1)

Paul J. Lucas • Aug 23 '23

int patternMatching(string text, string pattern)

That should be declared:

size_t patternMatching(string const &text, string const &pattern)

otherwise you make pointless copies of the arguments. In general, you should use size_t instead of int — otherwise you can get truncation warnings.

You also didn't mention that another nice thing about KMP is that it never has to backtrack in the input which means it works for streaming applications.

DEV Community