DEV Community

Cover image for DSA 101: Hashing & Hash Tables
Sanjeev Sharma
Sanjeev Sharma

Posted on

DSA 101: Hashing & Hash Tables

Hey,

In this article, we'll be going over Hash Tables. This is a theoretical introduction to Hashing.

What is Hashing?
Hashing is a technique for storing large datasets into key-value pairs using a hash function. It allows us to do search, insert and delete operations in O(1) time.

Hashing is a technique whereas Hash Table is a data structure that follows this technique. It is one of the most used data structures after arrays, in my opinion. This makes it more important to understand.

Some of the applications of Hashing/Hash Tables are:

  1. Dictionaries
  2. Database indexing
  3. Cryptography
  4. Caches
  5. Routers

alt text

This is an abstract overview of how Hashing works in general. Keys/data are passed in the hash function, which returns an index for Hash table. We insert the data into that index.


Hashing Function

The function that is used to generate indexes for Hash tables.

Traits of a good hash function:

  1. It should always map the same data/key at the same index.
  2. It should generate an index from 0 to m - 1, where m is the size of the Hash table.
  3. It should be fast; O(1) for integers and O(len) for strings, where len is the length of the string.
  4. It should uniformly distribute large keys.

Some examples of hash functions are:

  1. For integers: h(key) = key % m, where m is the size of the hash table.
  2. For strings: We take the ASCII sum of all the characters and add them. But this would result in duplication if you encounter permutations of the same string. So, a better way is to include a weighted average. For example: (str[0] * x^0 + str[1] * x^1 + str[2] * x^3) % m, where x is some random number and m is the size of the hash table.
  3. Universal Hashing: A random algorithm is chosen from a pool of hashing algorithms.

From what we have understood until now, it foolish to not ask this - "Doesn't it happen that two keys get the same index? What happens in such a case?". Well, no matter how good a hashing algorithm is, collisions are bound to happen. Next, we'll see how to tackle these collisions.


Collision Handling

Collisions are unavoidable but if keys are known to us well in advance, we can use the Perfect Hashing(not in scope of this article) technique.
For all other arbitrary dataset/keys we use one of the following two options:

  1. Chaining
  2. Open Addressing

Chaining

The basic idea behind this algorithm is to store colliding values at the same index using chaining(mostly in a Linked List). Let's see how it works:

alt text

The hash table here contains the references to the heads of linked lists.

Consider a key 21 that needs to be inserted in the hash table. First, we find the index using the hash function.
21 % 7 = 0 so this key will go on 0th index linked list. We do the same for every key.

But when we encounter a collision, let's say for key 23. We simply extend the linked list on the index 2 and start chaining keys.

Load Factor

Before figuring out the performance of chaining, it's important to know about load factor.

α = n / m, where α is the load factor, n is the number of keys to be inserted, and m is the number of slots in the hash table. The higher the load factor, the more collisions we'll have.

On average, the expected chain length is α. The expected time to search, insert and delete is O(1 + α), where 1 is the time taken by hash functions and α is the time taken to traverse any chain.

To store the chains, we are not limited to using Linked Lists only. Depending on the situation, arrays and self-balancing BSTs are good options too.

Some disadvantages of Chaining:

  1. Since we're using Linked Lists, it's not cache-friendly.
  2. We need extra space to store the links.

Open Addressing

This is an alternative to Chaining. It only uses arrays, so it's cache-friendly. There are three algorithms to look into:

  1. Linear Probing The general idea behind this algorithm is to linearly search for the next available slot whenever we encounter a collision. Let's understand this with an example:

alt text

In this example, the keys 49, 50, and 51 are directly inserted into the hash table without any collisions. When we are trying to insert 16, we have a collision on index 2. So, we just to the next empty slot which is index 3 and insert it there.

Same for key 56, we get the index 0 from our hash function but it's already occupied. The next index 1 is occupied too unlike key 16. So, we keep on jumping to the next slot until we find an empty slot. Finally, after 4 jumps we find an empty slot on index 4.

The table should be traversed in a circular fashion i.e. if we reach the end, we should go to index 0. If we are back at the same index we started from, that means the table is full and there is no empty slot.

For searching too, we use the same technique:

  1. Find the index using hash function.
  2. Check if value matches. If it doesn't match, jump to next slot.
  3. Keep on jumping until an empty slot is found. At that point, we are sure that the key we're looking for doesn't exist. So, we can safely return false.

❗ Note: when deleting keys for the hash table, don't make the slots empty again. This will mess up the searching process. Instead, mark them as deleted. Generally, libraries store a reference to an address to mark slots deleted. (Try doing a dry run, to figure out how searching is messed up if you empty the slots).

The problem with Linear probing is Clustering. For large datasets, we observe forming of clusters at some places in the hash table. This hinders the performance as we now have to traverse more slots to insert or search a key.

  1. Quadratic Probing This is a slight improvement over Linear probing. Instead, of jumping one slot at a time we jump in a quadratic fashion.

For example:
h(key) = (key % m) + i^2
If we have a collision, we first jump by 1 slot, then we jump by 4 slots, then 9, and so on...

There are two problems with this approach:

  1. Instead of Primary clusters we have Secondary clusters. It's better than linear probing clusters but still a problem.
  2. There are high chances of missing the slots.

  3. Double Hashing
    This algorithm uses a second hashing function to resolve collisions. It gets rid of the clustering completely and distributes the keys uniformly.

hash(key, I) = (h1(key) + i h2(key)) % m, here h2(key) should be relatively prime to m.

Let's understand this with an example:
alt text

Our h1 is key % 11 and h2 is 8 - (key % 8). It's a good move to make the table size a prime number so h2 and m are relatively prime. Since our m is 11, h2 can be any value less than 11, here we have taken it to be 8.

For the first two keys, 20 and 34 we find the index using h1 only. But for key 45, h1 give us the index 1 which is already occupied by 34.

Therefore, we use h2 to get another index. When executing h2, the value of i should be 1 as this is the first jump. For successive jumps, we increase i by 1.

Using h2, we get the new index as 4. We can fill in the rest of the keys in the similar manner. If even after first jump, we cannot find an empty slot, we run the function again using i as 2 and so on.

Double hashing guarantees the uniform distribution of keys.

Disadvantages of Open Addressing:

  1. Table may reach its capacity so resizing is required.
  2. Extra care is required for clustering.
  3. Extra space is required to match Chaining's performance.

That's all folks! ✌️

I hope you liked this article. I'll be writing more articles on data structures(javascript) and algorithms. Stay connected, if you're interested.

🌏 https://thesanjeevsharma.now.sh

Top comments (0)