DEV Community

Phyllipe Bezerra
Phyllipe Bezerra

Posted on • Edited on

File Compression: Understanding the Huffman Coding Algorithm

Hello! First of all I just wanna to make clear that this article is a result of me just trying to understand something new. Feel free to suggest or point out necessary corrections!

What is Huffman Enconding?

Huffman encoding is a lossless data compression algorithm that assings variable-length encoding "words" to fixed-length input characters based on their frequencies of occurence. These words are chosen in a way that more frequent characters have shorter words and less frequent characters have longer words.

Why Variable-length encoding?

Before that, let's understand what's the problem with fixed-length encoding. It's quite simple:

Fixed-length enconding waste space

See for example, the string "test tested test", since "test" appear three times we would be wasting space encoding this string with a fixed-length approach.

The solution

And here comes the variable-length approach:

  • Encoding of different lengths for different characters
  • Assing shorter encodings to frequently occuring characters

For exemple, take the word "test". First let's check the characters frequencies of occurence:

Characters Frequency
t 2
e 1
s 1

Now create a linked list in ascending order based on the frequency, the list node element will have a character space, frequency space, a pointer to the next list node, and to make things simple, right and left pointers that we will use to create a binary tree soon. So it would be something like this:

A linear list containing 3 elements, each element is formed by 3 horizontal rectangles vertically stacked and at the bottom 2 side-by-side squares, each of this elements have an arrow pointing to the next, the order is S, E and T, with 1, 1 and 2 frequency values respectively.

Now we need to remove and "merge" the nodes with the two lowest frequencies, forming a new node that will be the parent node. The parent node will have the character value as an asterisk (*) and the frequency value will be the sum of its children. Also the left and right pointers will point to the lowest frequency child and the second lowest frequency child respectively. So for this step with our list we would get something like this:

A list element at the top containing an asterisk as character and with frequency value as 2, right above at the left there is a child element with values S and 1, to character and frequency respectively, and at the right another child element with values R and 1. Forming a binary tree with 2 leaves.

And then we can put this tree at the tail of our list, and it will be like this:

The same previously binary tree with the next pointer pointing to the last list element with values T and 2.

While the head element of the list is not a parent node, i.e a node with character as asterisk, we need to repeat the previously step, so our list will be like this at the end of the second iteration:

A binary tree with root element with values * and 4, the left child is the tree mentioned before and the right child is an element with values T and 2.

At this point, we already can get the encoding for our characters. To get the enconding we need to go from the root node to the leaves tracking the direction, if we go left we can add the value 0 otherwise add 1 to the tracking value, for example:

The binary tree previously mentioned but each left arrow to left child have value 0 and each right arrow to right child have value 1.

To get the encoding for our character 'S' we need to go to the left and then left again, that way our enconding will be 00. For E character we need to go to the left and then right, so our encoding will be 01, and the T character we need to only go to the right and the encoding will be 1, at the end we would get the encodings as:

Characters Frequency Encoding
t 2 1
e 1 01
s 1 00

As you can see here we already have achieved the two rules previously mentioned:

  • Encoding of different lengths for different characters
  • Assing shorter encodings to frequently occuring characters

Now let's write down the encoded value of our string "test" and the binary value:

  • Binary Value: 01110100 01100101 01110011 01110100 - Length: 32
  • Encoded Value: 1 01 00 1 - Length: 6

As you can see here, we got a length reduction of 81%!

And that's my basic comprehension of the huffman encoding algorithm, next I will be probably coding a file compressor using the huffman encoding approach and I will leave the github repo link once done.

While that feel free to suggest or point out necessary corrections!

Top comments (0)