Rabin-Karp is an algorithm used for pattern matching/searching in a string that uses a hash function unlike the naïve algorithms.
- We will iterate through the string with a window with the same length of the pattern.
We will calculate a hash value for the pattern and the window of the string and check if:
The hash value of pattern equals the hash value of window of string, check each character of the window with the pattern as two distinct strings may have the same hash value due to collisions
The hash value of pattern isn't equal to the hash value of window of string, then check hash value of the next window with the hash value of the pattern and so on.
The hash function here is used to map a string to an integer value.
To get a useful hash function it should:
- be easily calculated
- can perform precalculations.
- calculating the hash function of any string takes constant time O(1)
# q is a prime number -modulus- def rabin_karp(string, pattern, q): d = 256 n = len(string) m = len(pattern) s = 0 p = 0 h = 1 i = 0 j = 0 for _ in range(m - 1): h = (h*d) % q for i in range(m): p = (d*p + ord(pattern[i])) % q s = (d*s + ord(string[i])) % q for i in range(n - m + 1): if p == s: for j in range(m): if string[i+j] != pattern[j]: break j += 1 if j == m: print("pattern found at index:", i) if i < n-m: s = (d * (s - ord(string[i])*h) + ord(string[i + m])) % q if s < 0: s = s + q string = "ABCDAF" pattern = "DAF" q = 113 rabin_karp(string, pattern, q)
A spurious hit occurs when two distinct strings have the same hash value. It increases the time complexity of the algorithm.
To reduce the occurrence of spurious hits, we use modulus.
The average case and best case complexity of Rabin-Karp is
O(m + n)
The worst case complexity is
O(nm), which happens when:
- all characters of pattern and text are same as the hash values of all windows of string will match with hash value of pattern.
- spurious hit occurs for every window.
However, if we use a good hashing function, the expected complexity would be
O(n + k . t), where t is the number of matches.
As a good hashing function would rarely cause collisions. Thus, we would rarely need to compare two substrings when there is no match.
If we only want to check the existence of the pattern, the complexity would be
O(n+k), as we can break the loop after the first occurrence.