loading...

String Searching Using Rabin-Karp

jjb profile image JB Updated on ・4 min read

The Rabin-Karp algorithm is a string-searching algorithm that uses hashing to find a pattern (or patterns) in an input string.

It is not the quickest algorithm for finding a single pattern, however a small tweak can allow the algorithm to perform well for multiple patterns. Using Rabin-Karp to search for multiple patterns could come in handy for things like detecting common words, sentences, or characters between documents or even source code.

In this post I will detail resources for learning about Rabin-Karp, go over it's key features, & provide a complete + efficient implementation of the algorithm (with test cases).

Resources:

  1. Rabin-Karp Wikipedia
  2. MIT Video Explanation
  3. Rolling Hash Video Explanation
  4. Article Explanation
  5. Implementation

Takeaways:

  • Aside from a tricky rolling-hash, the algorithm is simple compared to other string-searching algorithms.
  • The algorithm is as follows:
    • Accept an input i and a pattern p
    • Define a variable n to be the length of i and a variable m to be the length of p.
    • Create a hash of pattern p (hashP) and a hash of length m of input i (hashI).
      • I.e for an input i = "abcde" and a pattern length m of 2, hashI would be a hash of the substring "ab".
    • If hashP == hashI then we have possibly found pattern p at the start of input i. Check each character of the substring representing hashI with our pattern p to be sure (call this function CheckEqual())
    • Otherwise iterate from pattern length m to input length n
    • For each iteration recalculate the hash hashI:
      • We are sliding a window across our input
      • Window is of size m (pattern length)
      • This means for each iteration we are removing a leading character from our hash and adding a trailing character
      • If i = "abcde", m = 2, and hashI represents "ab" - then our first iteration would remove "a" from hashI and add "c". Meaning hashI now represents "bc". This part of the operation is done in constant time.
    • After rehashing, check if hashP == hashI. If they are the same, run our character equality check CheckEqual() to be certain.
    • Explore all of i until we have found our pattern or no pattern exists in i.
  • A rolling hash is a hash function where the input is hashed in a sliding window. As the window moves through the input the hash is updated quicker than rehashing all the characters in the new window - as the previous window and the current window share characters.
    • E.g if our input is "abcde" and our window size is 3. We first hash "abc" then slide our window and next hash "bcd" - a rolling hash takes advantage of the fact that "bc" was already hashed. Instead of hashing "bc" again a rolling hash removes "a" and appends to the hash of "bc" the hash of "c". This means a hash for "bc" was only calculated once for two hashes.
    • Extrapolated over larger inputs, and more times, a rolling hash can be very efficient. It's the reason Rabin-Karp is reasonably fast at string-searching.
  • Rabin-Karp's rolling hash is calculated modulo a large prime number. This means hash values are also relatively prime and the distribution of the hash values are more uniform.
    • Choosing a large prime helps us avoid false positives - where two hashes are the same but the values they represent are not.
    • Choosing a random prime means we can guard against the worst case running time, as no specific inputs will affect the running time predictably.
  • The rolling hash is often computed multiplied by a relatively prime Base/Radix. This value should be at least as large as the character set. For my implementation I used 256 - as this is the number of ASCII characters.
  • A common rolling hash for Rabin-Karp is: S[i]*R^m-i % Q. Where S is our input string, i is the position (in a loop/iteration), R is the radix/base, m is the pattern length, and Q is the prime.
        private long Hash(string input, int patternLength)
        {
            long hash = 0;

            for (int i = 0; i < patternLength; i++)
            {
                int ascii = input[i];
                // Assume Base/Prime are already defined
                hash = (Base * hash + ascii) % Prime;
            }

            return hash;
        }
  • Time complexity of Rabin-Karp is linear (specifically, amortized O(n + m) where n is input length and m is pattern length).
  • Brute forced approaches, and worst case Rabin-Karp (more likely with smaller, non-random primes, and/or bad rolling hash), are O(n * m).
  • Space is O(1)
  • We can modify Rabin-Karp with a Bloom Filter or simply a hash set to search for multiple patterns.
  • In my implementations I return the index (aka offset) the pattern starts at in the input. You could easily modify either the single or multiple pattern implementations to do other things (like count occurrences of patterns).

Below are Rabin-Karp implementations for single & multiple patterns (of the same size). The implementations assume ASCII input and will work for inputs up to a reasonably large size. The rolling hash uses a relatively large random prime to guard against poor running times:

As always, if you found any errors in this post please let me know!

Discussion

pic
Editor guide