CS Level Up Series (30 Part Series)

The Rabin-Karp algorithm is a string-searching algorithm that uses hashing to find a pattern (or patterns) in an input string.

It is not the quickest algorithm for finding a single pattern, however a small tweak can allow the algorithm to perform well for multiple patterns. Using Rabin-Karp to search for multiple patterns could come in handy for things like detecting common words, sentences, or characters between documents or even source code.

In this post I will detail resources for learning about Rabin-Karp, go over it's key features, & provide a complete + efficient implementation of the algorithm (with test cases).

## Resources:

- Rabin-Karp Wikipedia
- MIT Video Explanation
- Rolling Hash Video Explanation
- Article Explanation
- Implementation

#### Takeaways:

- Aside from a tricky
**rolling-hash**, the algorithm is simple compared to other string-searching algorithms. - The algorithm is as follows:
- Accept an input
`i`

and a pattern`p`

- Define a variable
`n`

to be the length of`i`

and a variable`m`

to be the length of`p`

. - Create a hash of pattern
`p`

(`hashP`

) and a hash of length`m`

of input`i`

(`hashI`

).- I.e for an input
`i = "abcde"`

and a pattern length`m`

of 2,`hashI`

would be a hash of the substring`"ab"`

.

- I.e for an input
- If
`hashP == hashI`

then we have possibly found pattern`p`

at the start of input`i`

. Check each character of the substring representing`hashI`

with our pattern`p`

to be sure (call this function`CheckEqual()`

) - Otherwise iterate from pattern length
`m`

to input length`n`

- For each iteration recalculate the hash
`hashI`

:- We are sliding a window across our input
- Window is of size
`m`

(pattern length) - This means for each iteration we are removing a leading character from our hash and adding a trailing character
- If
`i = "abcde"`

,`m = 2`

, and`hashI`

represents`"ab"`

- then our first iteration would remove`"a"`

from`hashI`

and add`"c"`

. Meaning`hashI`

now represents`"bc"`

. This part of the operation is done in constant time.

- After rehashing, check if
`hashP == hashI`

. If they are the same, run our character equality check`CheckEqual()`

to be certain. - Explore all of
`i`

until we have found our pattern or no pattern exists in`i`

.

- Accept an input
- A rolling hash is a hash function where the input is hashed in a
**sliding window**. As the window moves through the input the hash is updated quicker than rehashing all the characters in the new window - as the previous window and the current window share characters.- E.g if our input is
`"abcde"`

and our window size is 3. We first hash`"abc"`

then slide our window and next hash`"bcd"`

- a rolling hash takes advantage of the fact that`"bc"`

was already hashed. Instead of hashing`"bc"`

again a rolling hash removes`"a"`

and appends to the hash of`"bc"`

the hash of`"c"`

. This means a hash for`"bc`

" was only calculated once for two hashes. - Extrapolated over larger inputs, and more times, a rolling hash can be very efficient. It's the reason Rabin-Karp is reasonably fast at string-searching.

- E.g if our input is
- Rabin-Karp's rolling hash is calculated modulo a large prime number. This means hash values are also relatively prime and the distribution of the hash values are more uniform.
- Choosing a large prime helps us avoid false positives - where two hashes are the same but the values they represent are not.
- Choosing a random prime means we can guard against the worst case running time, as no specific inputs will affect the running time predictably.

- The rolling hash is often computed multiplied by a relatively prime Base/Radix. This value should be at least as large as the character set. For my implementation I used 256 - as this is the number of ASCII characters.
- A common rolling hash for Rabin-Karp is:
`S[i]*R^m-i % Q`

. Where`S`

is our input string,`i`

is the position (in a loop/iteration),`R`

is the radix/base,`m`

is the pattern length, and`Q`

is the prime.- Using Horner's Method this can be simplified in code to:

```
private long Hash(string input, int patternLength)
{
long hash = 0;
for (int i = 0; i < patternLength; i++)
{
int ascii = input[i];
// Assume Base/Prime are already defined
hash = (Base * hash + ascii) % Prime;
}
return hash;
}
```

- Time complexity of Rabin-Karp is linear (specifically, amortized
`O(n + m)`

where`n`

is input length and`m`

is pattern length). - Brute forced approaches, and worst case Rabin-Karp (more likely with smaller, non-random primes, and/or bad rolling hash), are
`O(n * m)`

. - Space is
`O(1)`

- We can modify Rabin-Karp with a Bloom Filter or simply a hash set to search for multiple patterns.
- In my implementations I return the index (aka offset) the pattern starts at in the input. You could easily modify either the single or multiple pattern implementations to do other things (like count occurrences of patterns).

Below are Rabin-Karp implementations for single & multiple patterns (of the same size). The implementations assume ASCII input and will work for inputs up to a reasonably large size. The rolling hash uses a relatively large random prime to guard against poor running times:

As always, if you found any errors in this post please let me know!

CS Level Up Series (30 Part Series)

## Discussion