Day 002: Why are hashes cached in dictionaries?

#100daysofcs #python #computerscience #todayilearned

Taking a closer look at how key-value pairs are stored in Python dictionaries yesterday, I noticed that alongside storing the key+value, the hash of the key is also stored.

Why would that be?

It can't be to speed up the lookup, as we cannot get to the data (the tuple where the hash+key+value is stored) until we know it's address, and address can only be known by computing the hash first.

Why then?

Turns out that when resizing the addresses vector, the hash for all the keys in items needs to be recomputed. This is expensive to calculate, thus slowing down the resize operation. Caching the hash speeds things up:

def resize_to(n):
    new_addresses = [None] * n
    for index, entry in enumerate(entries):
         h = perturb = entry[0]  # No need to compute the hash.
         addr = h % n
         while new_addresses[i] is not None:
             addr = (5 * addr + 1 + perturb) % n
             perturb >>= 5
         new_addresses[addr] = index
    return new_addresses

Other than this, storing the hash has another benefit -- during lookup, there's an inequality test required, which ensures that we've arrived at the correct hash+key+value pair, and need not probe any further.

The lookup code from the earlier post:

def lookup(key):
    perturb = h = hash(key)
    address = addresses[h % n]

    while items[address][1] != key: # <- inequality test

        address = (5 * address + 1 + perturb) % n
        perturb >>= 5
    return items[address][2]

Comparing two objects for [in]equality can be an expensive operation. The solution here is to rely on the fact that "if two objects have unequal hashes, then the objects must be unequal as well", and use an equality checker like below:

def fast_is_equal(
    key, target_key, key_hash, target_key_hash
):
    if key is target_key: return True  # Identity implies equality
    if hash_key != hash_target_key: return False
    return key == hash_key

And then during the lookup:

# ...
    while not fast_is_equal(
        # reference: items[address] -> (hash, key, value)
        items[address][1], key, items[address][0], h
    ):
# ...

So, my initial guess was right in a way: it does indeed speeds up the lookup, just not in a way I thought it'd be.

DEV Community

Day 002: Why are hashes cached in dictionaries?

Top comments (0)

Read next

💡Never forget again: Build a scheduled reminder app in <50 lines of Python using REST and Postgres

Python for DevOps: A Comprehensive Guide from Beginner to Advanced

Python Decorators: Simplifying Code

The Role of AI in Modern Cybersecurity: Tackling Social Engineering Threats with Advanced Defenses