Hash Functions and Hash Codes

#java #programming #learning #beginners

A typical hash function first converts a search key to an integer value called a hash code, then compresses the hash code into an index to the hash table. Java’s root class Object has the hashCode method, which returns an integer hash code. By default, the method returns the memory address for the object. The general contract for the hashCode method is as follows:

You should override the hashCode method whenever the equals method is overridden to ensure that two equal objects return the same hash code.
During the execution of a program, invoking the hashCode method multiple times returns the same integer, provided that the object’s data are not changed.
Two unequal objects may have the same hash code, but you should implement the hashCode method to avoid too many such cases.

Hash Codes for Primitive Types

For search keys of the type byte, short, int, and char, simply cast them to int. Therefore, two different search keys of any one of these types will have different hash codes.

For a search key of the type float, use Float.floatToIntBits(key) as the hash code. Note that floatToIntBits(float f) returns an int value whose bit representation is the same as the bit representation for the floating number f. Thus, two different search keys of the float type will have different hash codes.

For a search key of the type long, simply casting it to int would not be a good choice, because all keys that differ in only the first 32 bits will have the same hash code. To take the first 32 bits into consideration, divide the 64 bits into two halves and perform the exclusive-or operation to combine the two halves. This process is called folding. The hash code for a long key is

int hashCode = (int)(key ^ (key >> 32));

Note that >> is the right-shift operator that shifts the bits 32 positions to the right. For example, 1010110 >> 2 yields 0010101. The ^ is the bitwise exclusive-or operator. It operates on two corresponding bits of the binary operands. For example, 1010110 ^ 0110111 yields 1100001.

For a search key of the type double, first convert it to a long value using the Double.doubleToLongBits method, and then perform a folding as follows:

long bits = Double.doubleToLongBits(key); int hashCode = (int)(bits ^ (bits >> 32));

Hash Codes for Strings

Search keys are often strings, so it is important to design a good hash function for strings. An intuitive approach is to sum the Unicode of all characters as the hash code for the string. This approach may work if two search keys in an application don’t contain the same letters, but it will produce a lot of collisions if the search keys contain the same letters, such as tod and dot.

A better approach is to generate a hash code that takes the position of characters into consideration. Specifically, let the hash code be

s0*b(n - 1) + s1*b(n - 2) + c + sn-1

where si is s.charAt(i). This expression is a polynomial for some positive b, so this is called a polynomial hash code. Using Horner’s rule for polynomial evaluation (see Case Study: Converting Hexadecimals to Decimals), the hash code can be calculated efficiently as follows:

(...((s0*b + s1)b + s2)b + ... + sn-2)b + sn-1

This computation can cause an overflow for long strings, but arithmetic overflow is ignored in Java. You should choose an appropriate value b to minimize collisions. Experiments show that good choices for b are 31, 33, 37, 39, and 41. In the String class, the hashCode is overridden using the polynomial hash code with b being 31.

Compressing Hash Codes

The hash code for a key can be a large integer that is out of the range for the hash-table index, so you need to scale it down to fit in the index’s range. Assume the index for a hash table is between 0 and N-1. The most common way to scale an integer to between 0 and N-1 is to use

h(hashCode) = hashCode % N

To ensure that the indices are spread evenly, choose N to be a prime number greater than 2.

Ideally, you should choose a prime number for N. However, it is time consuming to find a large prime number. In the Java API implementation for java.util.HashMap, N is set to a value of the power of 2. There is a good reason for this choice. When N is a value of the power of 2,

h(hashCode) = hashCode % N

is the same as

h(hashCode) = hashCode & (N – 1)

The ampersand, &, is a bitwise AND operator. The AND of two corresponding bits yields a 1 if both bits are 1. For example, assume N = 4 and hashCode = 11, 11 % 4 = 3, which is the same as 01011 & 00011 = 11. The & operator can be performed much faster than the % operator.

To ensure that the hashing is evenly distributed, a supplemental hash function is also used along with the primary hash function in the implementation of java.util.HashMap. This function is defined as:

private static int supplementalHash(int h) { h ^= (h >>> 20) ^ (h >>> 12); return h ^ (h >>> 7) ^ (h >>> 4); }

^ and >>> are bitwise exclusive-or and unsigned right-shift operations. The bitwise operations are much faster than the multiplication, division, and remainder operations. You should replace these operations with the bitwise operations whenever possible.

The complete hash function is defined as:

h(hashCode) = supplementalHash(hashCode) % N

This is the same as

h(hashCode) = supplementalHash(hashCode) & (N – 1)

since N is a value of the power of 2.