Before diving into hash tables, let us quickly review two other data structures: arrays and linked lists.
Arrays
An array is used to store elements of the same type in contiguous memory locations. Since contiguous memory locations need to be reserved, you need to specify the size of the array upfront (when creating the array). Any element in the array can then be accessed in constant time since we know the exact memory location of each element in the array.
Since contiguous memory locations are used, we know exactly where to find the element at index 2 given the memory address at which the array starts.
Linked Lists
A linked list is also used to store a collection of elements. However, each element in the list is stored at a different location in memory. Each element in the list has a link pointing to the next element in the list.
Starting from the first element in the list (also known as the head of the list), we can follow this chain of links pointing to the next element in the list to sequentially access every item in the list. The last item in the list doesn't have a link pointing to the next element. That's how we know that we have reached the end of the list.
Since the items in a linked list can be stored anywhere in memory, we do not need to specify upfront what the size of the list should be. The linked list grows and shrinks dynamically as elements are added to and removed from it. However, we lose the ability to directly access elements at a given index in the list. Instead, we have to start at the head of the list and follow the links to the next item until we arrive at the index we want.
Hash Tables - the middle ground
Arrays give us the advantage of quick random access, but require us to define a size in advance. On the other hand, linked lists allow us to add and remove items without the need to specify a size. However, they take away our ability to access an item at a given index in constant time.
Hash tables give us the best of both worlds. They allow us to add, look up and remove elements in constant time1, and they also allow us to add and remove elements without having to specify a list size upfront.
The Hash Function
When storing an element in a hash table, it is first passed through a hash function. The output of the hash function is the location in the hash table where the element should be stored.
Since we need to look up the element from the table later, we need to ensure that when identical elements are passed through the hash function, it produces the same value.
However, it is possible that when different elements are passed through the hash function, it produces the same value. When this happens, we call it a collision.
A good hash function minimizes the chance of collision, but collisions cannot be eliminated entirely.
Handling Collisions
There are two main ways of dealing with collisions: linear probing and chaining.
Linear Probing
When using linear probing, when there is a collision, we just keep looking at the next available location in the table until we find a free slot (or realize that the table is full and we can't store any more elements2).
When using this approach, we need to adjust the way we look up values from the hash table as well. If we hash an element and don't find it at the index specified by the hash function, then there is still a possibility that it was stored at the next available free slot in the table. We therefore have to look other locations in the table, one at a time, starting from the index specified by the hash function to ensure that the element really isn't stored in the hash table. That sounds very similar to traversing all elements in a linked list!3
Chaining
The other way we can deal with collisions is by storing all elements that get mapped to the same index in a linked list within the hash table.
We are now not restricted by the size of the hash table when storing elements, since the elements themselves are stored in a linked list outside the hash table. The hash table just contains links to the head of the various linked lists.
However, if we have too many collisions and a lot of elements get mapped to the same index then the linked list for that index grows. We then have to traverse the entire linked list to find the element we are looking for. This is why it is important that we use a hash function that minimizes collisions.
Choosing a Good Hash Function
A good hash function minimizes the number of collisions by trying to uniformly distribute values over the entire hash table. It also uses all the information available in the element to calculate the hash value, and tries to map similar elements to different locations in a hash table. Finally, a hash function should be very fast to compute since we use it every time we insert into or look up a value in a hash table.
Complexity Analysis
Assuming there are no collisions, it takes constant time to insert into and look up values from a hash table. The actual time taken for these operations depends on how fast the hash function is. However, when there are collisions, the hash table performs slightly worse.
When using chaining, it still takes a constant time to add a new element to the hash table since adding a new element at the beginning of a linked list can be done in constant time. When using linear probing, however, we may take up to O(n)
time (where n
is the size of the hash table) in the worst case where the hash table is full and we probe the entire table looking for a free slot.
When looking up values in a hash table during a collision, we take O(n)
time (where n
is the size of the hash table) since in the worst case, all elements are mapped to the same index and therefore we have to traverse the entire list of all elements to find the one we are looking for. This applies to deleting elements from a linked list as well, since to delete an element we first need to find it.
We can use a balanced search tree instead of a linked list while chaining to reduce the worst case time complexity to O(log n)
.
Conclusion
Although hash tables provide a good balance between an array and a linked list, they are by no means a silver bullet.
Since their worst case performance is O(n)
, if you know how many elements you need to store upfront, you would be better off using an array which guarantees adding and updating elements in constant time.
Also, if you care about the order of items in the list and don't know how many items you need to store upfront, then you need to use a linked list since a hash table does not store this information.
Resources
- Google Tech Dev Guide
- Hash Tables - CS50 - YouTube
- Introduction to Data Structure - Hash Table - LeetCode
-
The operations take constant time on average, and perform worse in the worst case. See the complexity analysis section for more details. ↩
-
When this happens, we can tell the user that the hash table is full, but then we wouldn't be doing any better than an array. Instead, we can dynamically increase the size of the hash table once it is full and redistribute the existing items and make space for new items. ↩
-
The worst case time complexity using this approach is in fact O(n) - the same as that of finding a value in a linked list. ↩
Top comments (1)
Hi! nice article, just to complement it: hash tables have an average and best case of O(1), but worst case of O(n), so we want to avoid collisions like you wrote, another strategy to avoid collisions is double hashing, this reduces the clusters of elements that slow down our search function (because if we are searching and find a empty spot, the element does not exist, so "holes" are good)