Mateus

Posted on

# Data Structures - Hash Tables

Introduction

Data structures are a fundamental part of initial programming studies. Understanding different data structures and knowing how to use them is essential. While studying the course: "Learn What Data Structures and Algorithms Are," the topic of hash tables was covered more superficially than other data structures. I decided to seek more information online. As a good apprentice developer, I googled it and soon found several videos discussing the topic. However, these definitions were also superficial, leading to inevitable frustration. I then tried ChatGPT, but believe it or not, it was down, and I couldn't get any answers. So, I sought help from Copilot (thanks to IFPR and their great help from education packages =D) and managed to understand a bit more than I had from the course. I decided to write this article to learn more and get feedback from those who understand the subject better. So, let's get into the little I understood.

Definition

A hash table is a data structure that associates keys with values, allowing for fast and efficient searches, making it ideal for indexing and data access. Each key-value pair is called a "slot." Hash tables are widely used in user authentication systems to store passwords securely; we will discuss this in more detail later.

An example of using hash tables is:

Associative arrays: Mapping keys (e.g., names) to values (e.g., phone numbers).

```Index 0: "Paula" | "7198749874" <- slot 1 Index 1: "Antenor" | "71999999999" <- slot 2 Index 2: <- slot 3 Index 3: <- slot 4 Index 4: "Fernando" | "11999994444" <- slot 5```

In this example, we have a hash table that maps a person's name to their phone number.

Sets: Membership verification (e.g., checking if an element is in a set). The set elements are not necessarily ordered.

For example, the set of prime numbers (2, 3, 5, 7, 11). To verify membership, we use the in operation.

Caches: Temporary storage areas for frequently used data. For example, a cache of query results from a database prevents repeated queries to the database since the data will be used again. When a query is made, the cache is checked first. If the data is there, it avoids another query, reducing data flow and speeding up the application.

In summary, hash tables are a way to index data. They are designed to allow quick access to values based on associated keys, even when retrieving information from large data sets. The hash code (index) is calculated and mapped to an index in the array, and the values are stored in the corresponding indices.
Hash Function

A hash function maps the key to an index in the hash table. Let's see an example. Suppose we have a hash table with 10 indices numbered from 0 to 9. We want to store keywords in this table. We will create a basic hash function to store the words: "cat," "dog," "lion," "goat," "jaguar," "bear," "elephant," "bull," "snake," "alligator." To populate the hash table, we will use the following function: the remainder of the division between the sum of the ASCII values of each character in the words by the number of table positions.

1st word: "cat" => (103 + 97 + 116 + 111) % 10
-> 427 % 10 = 7, this is the index (hash) where we will store the word "cat."

2nd word: "dog" => (99 + 97 + 99 + 104 + 111 + 114 + 114 + 111) % 10
-> 849 % 10 = 9, this is the hash for the word "dog."

This process will be followed for all words. It is not uncommon, especially in this case where the number of table indices is limited, for a word to need to be stored at the same index as another word. This is called a "collision." Collisions are common in hash functions.

Collisions

Example of a Hash Function

Consider a simple hash function that computes the hash value by summing the ASCII values of the characters in the key and then taking the modulus with the size of the hashmap.

For example, suppose our hashmap has 10 buckets and our hash function is:

```def hash_function(key): return sum(ord(char) for char in key) % 10 ```

Let's compute the hash values for two different keys:

``````Key: "cat"
ASCII values: c (99), a (97), t (116)
Sum: 99 + 97 + 116 = 312
Hash value: 312 % 10 = 2

Key: "act"
ASCII values: a (97), c (99), t (116)
Sum: 97 + 99 + 116 = 312
Hash value: 312 % 10 = 2
``````

Both "cat" and "act" produce the same hash value of 2 because of their same ASCII values summed up, leading to a collision.

But then the question arises: how to store the words if different words cannot be stored in the same index? There are some ways to resolve collisions.

Resolving Collisions

The most common methods for resolving collisions are: Separate Chaining, Open Addressing, and Perfect Hashing (best-case scenario).

Separate Chaining: In this case, each slot in the table contains a linked list of elements with the same index. When a collision occurs (i.e., two keys are mapped to the same slot), the elements are simply added to the corresponding list. Suppose the words "dog" and "elephant" map to the same slot; then the hash table would look like this:

```Index 0: Index 1: Index 2: Index 3: Index 4: Index 5: Index 6: Index 7: 7 | "cat" Index 8: Index 9: 9 | "dog" -> "elephant" ```

Open Addressing: Another way to resolve collisions is through Open Addressing or linear probing, which involves trying to find another free slot in the table. Linear probing is a common technique where slots are checked sequentially until an empty one is found. In the example of the collision between "dog" and "elephant," the word "elephant" would be stored in the slot with index 0, as it is the next empty slot. If slot 0 were occupied, then slot 1 would be tried, and so on. In this case, a free slot will always be found since there are 10 words for 10 slots.

Perfect Hashing: Perfect hashing is the ideal situation but is not always possible. It means that each key has a unique index in the hash table without collisions. If we had a perfect hash function, each word would be perfectly indexed, and there would be no collisions.

There are strategies aimed at finding a good hash function that ensures maximum spreading across the table. Perfect hashing is when, given a table of n positions and n elements to be inserted, there is no collision. One strategy for choosing the hash function is the Cichelli method, which aims to minimize collisions by making some adjustments to the hash to differentiate it from the already calculated hash. This adjustment can be called a "salt," which is, for example, a constant added to the hash, creating different possible values for collisions. This salt must be stored in some field of the table, so when attempting to access the value, it is added to the hash, and the values match. This may be a bit confusing.

Remember the example where the words "dog" and "elephant" collided. The salt could be the hash divided by 2 or the number 1, for example. But the salt is added only to the second value that caused the collision, so it would only affect the hash of the word "elephant."

For example:

`Index 9: 9 | "dog"

Hash calculated for "elephant" is 9; adding the salt 1, the hash becomes 10 % 10 = 0, so the word "elephant" would be stored in the slot with index 0.

Index 0: 0 | "elephant"
Index 7: 7 | "cat"`

The hash table data structure has a complexity of Big O(1) notation because the search is immediate, making it ideal.

Application

A very common application for hash tables is user authentication systems. The user creates a password, a hash is applied to this password, and the hash of the password is stored in the database, never storing the actual password typed by the user. This ensures the security of such sensitive information. Several hashes are used for this: MD4, MD5, SHA-1, SHA-256, SHA-512.

But if the password is not stored, only the hash of the password, how do you know if the user entered the correct password? Simple, the password entered by the user is hashed again, and the newly generated hash is compared with the stored hash in the database. This is how authentication occurs.

Another use of hashing is file verification. How do hash functions ensure the integrity of files? When the hash of a file is calculated, a compact representation of the data is obtained. If there is any alteration in a file, the resulting hash will be different. Therefore, such a comparison allows you to verify if the file remained unchanged. This comparison can be done before and after downloading a file, for example.

Another application of hash tables is digital signatures. Digital signatures use hash functions to verify the authenticity of documents. The hash of the document is encrypted with the sender's private key, creating a digital signature. The recipient can verify the signature by decrypting the hash with the sender's public key.

An example of using hash tables for a small user authentication system. Here I will add Python code. The user creates a password, and the password is stored in a hash table. At login, the password entered by the user is hashed again and then checked if the hash values are identical. For this example, I will use MD5 hashing. I will use the Python hashlib module to work with hash functions.

``````import hashlib

# Create an MD5 object
md5_hash = hashlib.md5()
# Update the MD5 object with the password as a hexadecimal string
# Return the MD5 hash as a hexadecimal string
return md5_hash.hexdigest()

# Check if the username is in the hash table and if the hashes match

# Create an MD5 object
md5_hash = hashlib.md5()
# Update the MD5 object with the password as a hexadecimal string
# Return the MD5 hash as a hexadecimal string
return md5_hash.hexdigest()

# Check if the username is in the hash table and if the hashes match
else:

# Registering users
register_user('Paula', '1234')
register_user('Antenor', 'abcd')

# Registering users
register_user('Paula', '1234')
register_user('Antenor', 'abcd')

``````

In the example, there is no method to handle collisions. What if there is a collision between the passwords registered by different users? Let's try to handle collisions. In this example, we will use the chaining technique, where each entry in the hash table is a linked list (i.e., a list of passwords with the same hash value). If there is a collision, we add a new password to the existing list. To add collision handling, we will modify the method for adding a user, as this is where values are inserted into the hash table. The method will look like this:

``````def add_user(user, password):

# Check if the user already exists in the table