loading...

String Searching Using Rabin-Karp

jjb profile image JB Updated on ・4 min read

CS Level Up Series (30 Part Series)

1) CS Level Up Series Introduction 2) Dynamic Arrays 3 ... 28 3) Linked Lists 4) Stacks 5) Queues 6) Hash Tables 7) Binary Search Trees 8) Binary Heaps 9) Priority Queues 10) Graphs 11) Tries 12) Binary & Bit Manipulation 13) Common Sorting Algorithms 14) Searching Algorithms 15) Permutations, Combinations, & Subsets 16) NP-Complete & Fibonacci Heap 17) Detecting Graph Cycles With Depth-First Search 18) Finding Shortest Paths In Graphs (using Dijkstra's & BFS) 19) Topological Sorting of Directed Acyclic Graphs (DAGs) 20) Finding Articulation Points & Bridges in Undirected Graphs 21) Finding Strongly Connected Components in Directed Graphs using Tarjan's Algorithm 22) Checking If An Undirected Graph Is Bipartite 23) Extending An Iterator 24) Union-find (Disjoint-set) 25) Minimum Spanning Tree (Kruskal's Algorithm) 26) Sliding Window Technique 27) String Searching Using Rabin-Karp 28) Fenwick Tree (Binary Indexed Tree) 29) Dynamic Programming 30) Traveling Salesman Problem

The Rabin-Karp algorithm is a string-searching algorithm that uses hashing to find a pattern (or patterns) in an input string.

It is not the quickest algorithm for finding a single pattern, however a small tweak can allow the algorithm to perform well for multiple patterns. Using Rabin-Karp to search for multiple patterns could come in handy for things like detecting common words, sentences, or characters between documents or even source code.

In this post I will detail resources for learning about Rabin-Karp, go over it's key features, & provide a complete + efficient implementation of the algorithm (with test cases).

Resources:

  1. Rabin-Karp Wikipedia
  2. MIT Video Explanation
  3. Rolling Hash Video Explanation
  4. Article Explanation
  5. Implementation

Takeaways:

  • Aside from a tricky rolling-hash, the algorithm is simple compared to other string-searching algorithms.
  • The algorithm is as follows:
    • Accept an input i and a pattern p
    • Define a variable n to be the length of i and a variable m to be the length of p.
    • Create a hash of pattern p (hashP) and a hash of length m of input i (hashI).
      • I.e for an input i = "abcde" and a pattern length m of 2, hashI would be a hash of the substring "ab".
    • If hashP == hashI then we have possibly found pattern p at the start of input i. Check each character of the substring representing hashI with our pattern p to be sure (call this function CheckEqual())
    • Otherwise iterate from pattern length m to input length n
    • For each iteration recalculate the hash hashI:
      • We are sliding a window across our input
      • Window is of size m (pattern length)
      • This means for each iteration we are removing a leading character from our hash and adding a trailing character
      • If i = "abcde", m = 2, and hashI represents "ab" - then our first iteration would remove "a" from hashI and add "c". Meaning hashI now represents "bc". This part of the operation is done in constant time.
    • After rehashing, check if hashP == hashI. If they are the same, run our character equality check CheckEqual() to be certain.
    • Explore all of i until we have found our pattern or no pattern exists in i.
  • A rolling hash is a hash function where the input is hashed in a sliding window. As the window moves through the input the hash is updated quicker than rehashing all the characters in the new window - as the previous window and the current window share characters.
    • E.g if our input is "abcde" and our window size is 3. We first hash "abc" then slide our window and next hash "bcd" - a rolling hash takes advantage of the fact that "bc" was already hashed. Instead of hashing "bc" again a rolling hash removes "a" and appends to the hash of "bc" the hash of "c". This means a hash for "bc" was only calculated once for two hashes.
    • Extrapolated over larger inputs, and more times, a rolling hash can be very efficient. It's the reason Rabin-Karp is reasonably fast at string-searching.
  • Rabin-Karp's rolling hash is calculated modulo a large prime number. This means hash values are also relatively prime and the distribution of the hash values are more uniform.
    • Choosing a large prime helps us avoid false positives - where two hashes are the same but the values they represent are not.
    • Choosing a random prime means we can guard against the worst case running time, as no specific inputs will affect the running time predictably.
  • The rolling hash is often computed multiplied by a relatively prime Base/Radix. This value should be at least as large as the character set. For my implementation I used 256 - as this is the number of ASCII characters.
  • A common rolling hash for Rabin-Karp is: S[i]*R^m-i % Q. Where S is our input string, i is the position (in a loop/iteration), R is the radix/base, m is the pattern length, and Q is the prime.
        private long Hash(string input, int patternLength)
        {
            long hash = 0;

            for (int i = 0; i < patternLength; i++)
            {
                int ascii = input[i];
                // Assume Base/Prime are already defined
                hash = (Base * hash + ascii) % Prime;
            }

            return hash;
        }
  • Time complexity of Rabin-Karp is linear (specifically, amortized O(n + m) where n is input length and m is pattern length).
  • Brute forced approaches, and worst case Rabin-Karp (more likely with smaller, non-random primes, and/or bad rolling hash), are O(n * m).
  • Space is O(1)
  • We can modify Rabin-Karp with a Bloom Filter or simply a hash set to search for multiple patterns.
  • In my implementations I return the index (aka offset) the pattern starts at in the input. You could easily modify either the single or multiple pattern implementations to do other things (like count occurrences of patterns).

Below are Rabin-Karp implementations for single & multiple patterns (of the same size). The implementations assume ASCII input and will work for inputs up to a reasonably large size. The rolling hash uses a relatively large random prime to guard against poor running times:

As always, if you found any errors in this post please let me know!

CS Level Up Series (30 Part Series)

1) CS Level Up Series Introduction 2) Dynamic Arrays 3 ... 28 3) Linked Lists 4) Stacks 5) Queues 6) Hash Tables 7) Binary Search Trees 8) Binary Heaps 9) Priority Queues 10) Graphs 11) Tries 12) Binary & Bit Manipulation 13) Common Sorting Algorithms 14) Searching Algorithms 15) Permutations, Combinations, & Subsets 16) NP-Complete & Fibonacci Heap 17) Detecting Graph Cycles With Depth-First Search 18) Finding Shortest Paths In Graphs (using Dijkstra's & BFS) 19) Topological Sorting of Directed Acyclic Graphs (DAGs) 20) Finding Articulation Points & Bridges in Undirected Graphs 21) Finding Strongly Connected Components in Directed Graphs using Tarjan's Algorithm 22) Checking If An Undirected Graph Is Bipartite 23) Extending An Iterator 24) Union-find (Disjoint-set) 25) Minimum Spanning Tree (Kruskal's Algorithm) 26) Sliding Window Technique 27) String Searching Using Rabin-Karp 28) Fenwick Tree (Binary Indexed Tree) 29) Dynamic Programming 30) Traveling Salesman Problem

Posted on May 5 by:

Discussion

markdown guide