DEV Community

M Muiz Hassan
M Muiz Hassan

Posted on

String Hashing in C++

Hashing is an algorithm that, given any input, results in a fixed size output called hash. Today, we use hashing algorithm in data structures, cryptography, and searching etc. In this tutorial we will implement a string hashing algorithm in C++ and use it in a data structure called hash table.

Polynomial Rolling Algorithm

We shall use Polynomial Rolling algorithm to generate a hash code from a string input. It uses the following formula:

hash = x₁*A⁰ + x₂*Aⁱ + x₃*A² ... xₙ*Aⁿ⁻ⁱ
Enter fullscreen mode Exit fullscreen mode

We get x from the ASCII code of a character in the string. We then multiply this ASCII code with "A", a prime number. We use a prime number to evenly distribute the values. The prime number should be close to number of available characters. For English alphabets 31 is a good empirical choice. The exponent ensures a different hash code for anagrams.

Now let's implement this algorithm in the form of a function.

#include <cmath>
#include <iostream>
using namespace std;

const int PRIME_CONST = 31;

int hashFunction(string key) {
    int hashCode = 0;
    for (int i = 0; i < key.length(); i++) {
        hashCode += key[i] * pow(PRIME_CONST, i);
    }
    return hashCode;
}

int main() {
    cout << hashFunction("abc") << endl;

    return 0;
}
Enter fullscreen mode Exit fullscreen mode

In C++ we do not need to convert a character to ASCII code, as the conversion is implicit. This will print the hash of string "abc" on console.

Using the hash function in a hash table

A hash table is a data structure that uses a hashing algorithm to store key-value pairs. It is known as HashMap in Java and Dictionary in Python. In C++ it called unordered_map, which is part of unordered_map standard library.

Mapping Values

To map our values in the hash-table array we take modulus of the hash by the array's size:

index = hash code % SIZE_OF_ARRAY
Enter fullscreen mode Exit fullscreen mode

This number is always less than the size of array and we can use it as the index for the value. The hash function will always produce the same index for same string key, that is the beauty of the hash function.

We can use the hash-function implementation above, but there is a huge problem. The hash code can overflow for very large values. We need a way to use our hash code for direct addressing without losing to overflow.

Fortunately, we have this formula to save the day:

(a + b) % c = (a % c + b % c) % c
Enter fullscreen mode Exit fullscreen mode

and for n terms:

Σx % c = Σ(x % c) % c
Enter fullscreen mode Exit fullscreen mode

The modulus shrinks the value in each iteration, so we can avoid getting a huge number. That's how we can map very long strings.

Now we create a HashTable class to use our fancy hash function.

#include <cmath>
#include <iostream>
#include <algorithm>
using namespace std;

class HashTable {
    static const int PRIME_CONST = 31;
    static const int ARR_SIZE = 3001;
    int arr[ARR_SIZE];

    public:
        HashTable() {
            fill_n(arr, ARR_SIZE, 0);
        }

        void put(string key, int value) {
            int index = hashFunc(key);
            arr[index] = value;
        }

        static int hashFunc(string key) {
            size_t sum;
            for (int i = 0; i < key.length(); i++) {
                sum += (key[i] * (int)pow(PRIME_CONST, i)) % ARR_SIZE;
            }
            return sum % ARR_SIZE;
        }

        void printAll() {
            for (int i = 0; i < ARR_SIZE; i++) {
                if (arr[i] > 0) {
                    cout << arr[i] << ", ";
                }
            }
            cout << endl;
        }
};

int main() {
    HashTable hTable;
    hTable.put("abc", 1);
    hTable.put("longString", 200);
    hTable.put("veryLongString", 3000);
    hTable.printAll();

    return 0;
}
Enter fullscreen mode Exit fullscreen mode

The length of array should be slightly larger than the count of data to be inserted. We use a prime number as array size to reduce collisions. Above code prints the non-zero values, inserted in the hash table, on console.

A collision occurs when we get the same index for two nodes. We use techniques such as separate chaining and open addressing for collision handling. The implementation in this tutorial does not handle collisions.

And that concludes this tutorial. We learned how to implement a hash function and use it in a data structure.


This tutorial is my attempt to help beginners learn hashing by collecting all the fundamental concepts at one place. I hope you find it beneficial.

Discussion (0)