DEV Community

loading...
Cover image for Day 002: Why are hashes cached in dictionaries?

Day 002: Why are hashes cached in dictionaries?

rj722 profile image Rahul Jha ・2 min read

Taking a closer look at how key-value pairs are stored in Python dictionaries yesterday, I noticed that alongside storing the key+value, the hash of the key is also stored.

Why would that be?

It can't be to speed up the lookup, as we cannot get to the data (the tuple where the hash+key+value is stored) until we know it's address, and address can only be known by computing the hash first.

Why then?

Turns out that when resizing the addresses vector, the hash for all the keys in items needs to be recomputed. This is expensive to calculate, thus slowing down the resize operation. Caching the hash speeds things up:

def resize_to(n):
    new_addresses = [None] * n
    for index, entry in enumerate(entries):
         h = perturb = entry[0]  # No need to compute the hash.
         addr = h % n
         while new_addresses[i] is not None:
             addr = (5 * addr + 1 + perturb) % n
             perturb >>= 5
         new_addresses[addr] = index
    return new_addresses
Enter fullscreen mode Exit fullscreen mode

Other than this, storing the hash has another benefit -- during lookup, there's an inequality test required, which ensures that we've arrived at the correct hash+key+value pair, and need not probe any further.

The lookup code from the earlier post:

def lookup(key):
    perturb = h = hash(key)
    address = addresses[h % n]

    while items[address][1] != key: # <- inequality test

        address = (5 * address + 1 + perturb) % n
        perturb >>= 5
    return items[address][2]
Enter fullscreen mode Exit fullscreen mode

Comparing two objects for [in]equality can be an expensive operation. The solution here is to rely on the fact that "if two objects have unequal hashes, then the objects must be unequal as well", and use an equality checker like below:

def fast_is_equal(
    key, target_key, key_hash, target_key_hash
):
    if key is target_key: return True  # Identity implies equality
    if hash_key != hash_target_key: return False
    return key == hash_key
Enter fullscreen mode Exit fullscreen mode

And then during the lookup:

# ...
    while not fast_is_equal(
        # reference: items[address] -> (hash, key, value)
        items[address][1], key, items[address][0], h
    ):
# ...
Enter fullscreen mode Exit fullscreen mode

So, my initial guess was right in a way: it does indeed speeds up the lookup, just not in a way I thought it'd be.

Discussion (0)

pic
Editor guide