Less basic compression: Huffman Coding

#python3 #losslesscompression #huffmancoding

In response to my last post I was given some good advice:

Nice article, look into Huffman encoding it kind of does the same thing for binary files and the particular algorithm was invented when he was a student.

So I ran off to Wikipedia and had a read.

It took me a while, certainly longer than the afternoon my simplistic effort took, to get my head around the compression and decompression algorithms. I wrote code as I went, and in the end I came up with a working proof of concept. (It's in the same repo as last time.):

Mellen / file_compression

A python 3 programme for compressing files

File Compression

A exploratory look at how to compress files.

tc.py uses my basic understanding of how losses compression should work (see https://dev.to/mellen/the-basic-basics-of-compression-lap to see how I came up with it.)

bc.py is an implementation of Huffman coding, that works for any type of file. It is a lot better at compression compared to my basic implementation.

(Requires python 3)

tc.py is a text file compression programme

 usage: tc.py [-h] [-d] file name

 positional arguments:  
    file name   The name of the file to be processed.

 optional arguments:  
    -h, --help  show this help message and exit  
    -d          Decompress the file (defaults to compressing).

bc.py is a compression programme for any type of file

 usage: bc.py [-h] [-d] file name

 positional arguments:  
    file name   The name of the file to be processed.

 optional arguments:  
    -h, --help  show this help message and exit  
    -d          Decompress the file (defaults to compressing).

View on GitHub

If you just want to jump into the code, the files compressor.py, decompressor.py and binarytree.py have the important code, specifically BinaryCompressor, BinaryDecompressor, Leaf, Node, and buildTree.

The first thing every article about Huffman coding will talk about is building a binary tree. I don't think it's in the scope of this post to explain binary trees (for no better reason than I'm not an expert in them), so Wikipedia will have to suffice. As a short summary, the tree structure in Huffman coding looks a lot like a genealogy tree, but is built backwards, from the children to the ancestors.

Huffman coding is built upon the idea that the more frequent a symbol is in a text, the less space it should take up. So the first step to take when compressing a file is to find out the frequency of each byte.

def getFrequencies(self, all_bytes):
    byte_set = set(all_bytes)

    bytes_frequency_dict = {b:0 for b in byte_set}

    for b in all_bytes:
        bytes_frequency_dict[b] = bytes_frequency_dict[b] + 1

    return sorted([item for item in bytes_frequency_dict.items()], key=lambda item:item[1])

This is part of the BinaryCompressor class. It takes a list of bytes, from the file, creates a set from them and turns that into a list of byte values and their respective frequencies.

It is, in fact, possible to skip this step if you use a predefined mapping of byte to frequency. You might choose a predefined set of frequencies if you have a limited domain, for (a silly) example, if you are compressing an English scrabble board your frequencies are ('E',12), ('A',9), ('I',9), ('O',8), ('N',6), ('R',6), ('T',6), ('L',4), ('S',4), ('U',4), ('D',4), ('G',3), ('B',2), ('C',2), ('M',2), ('P',2),('F',2), ('H',2), ('V',2), ('W',2), ('Y',2),('K',1),('J',1), ('X',1), ('Q',1), ('Z',1). The benefit to a predefined mapping is that you don't have to store the mapping in the compressed file, it becomes hardcoded into the algorithm. However, I wanted the algorithm to apply to a broad range of files, so it generates frequencies on a per file basis.

Now we have the frequencies, what do we do with them? If you read the last post, or dug around in the git log of the repository, you might know that I had a dalliance with frequencies before, with text compression, however I couldn't think of how to encode the data in a decompressible form. This is what the tree will be for.

def buildTree(bytes_frequency):
    tree = [Leaf(bf, bf[1]) for bf in bytes_frequency]

    leaves = []

    while len(tree) > 1:
        left, right = tree[:2]
        if type(left) is Leaf:
            leaves.append(left)
        if type(right) is Leaf:
            leaves.append(right)
        tree = tree[2:]
        node = Node(left, right, left.value + right.value)
        tree.append(node)
        tree = sorted(tree, key=lambda node: node.value)

    return leaves, tree[0]

This is basically the algorithm as described in the Wikipedia article. The tree starts as a collection of leaves sorted by (ascending) frequency. The first pair of leaves are removed from the list and joined in a Node, whose value becomes the sum of the values of the leaves. The new node is then added into the list and the list is sorted. This is repeated until there is only one node left in the list. This is the root node. As well as the root node the function returns a list of the leaves. This is to make it easier to create the byte → code word map.

The Node and Leaf classes have a code member, this stores the code word for the Leaf or the progress of the code word in the Node. The code word for a byte is built backwards, by updating the code of each Node in the branch of the tree with the specific Leaf at the end when the Node is made. Left nodes update with '1' and right ones update with '0'.

class Leaf():
    def __init__(self, ...):
        ...
        self.code = ''


    def update_code(self, update):
        self.code = update + self.code


class Node()
    def __init__(self, left, right, value):
        self.value = value
        self.left = left
        self.right = right
        self.code = ''

        self.left.update_code(LEFT)
        self.right.update_code(RIGHT)

    def update_code(self, update):
        self.code = update + code
        self.left.update_code(update)
        self.right.update_code(update)

So now the tree is built and the code words are established. Encoding (i.e. compression) is a case of replacing the bytes with their codewords.

symbol_map = {leaf.data[0]:leaf.code for leaf in leaves}

output_bits = '1'
for b in all_bytes:
    output_bits = output_bits + symbol_map[b]

The bits are initialised with '1' in case the calculated bits start with a 0, so that no information is lost when it's converted to an integer. From here the bits are turned into an integer.

byte_count = (len(output_bits)+7) // 8

output_int = int(output_bits, 2)

output_bytes = output_int.to_bytes(byte_count, sys.byteorder)

Before writing these bytes to a file, a header is created that stores the set of bytes and their frequencies. I'm allowing the maximum size of the frequency to be variable, in case I'm compressing a very big file. So the first 2 bytes are the number of unique bytes, then the next 8 are the size of the frequency, the rest are each byte/frequency pair.

max_count_bytes = ceil(leaves[-1].data[1].bit_length()/8)

header_bytes = len(leaves).to_bytes(2, sys.byteorder)
header_bytes += max_count_bytes.to_bytes(8, sys.byteorder)

for leaf in leaves:
    header_bytes += (leaf.data[0].to_bytes(1, sys.byteorder))
    header_bytes += leaf.data[1].to_bytes(max_count_bytes, sys.byteorder)

with open(self.outputname, 'wb') as out_file:
    out_file.write(header_bytes)
    out_file.write(output_bytes)

That's the compression done.

Decompression starts similarly, by building the tree. But this time the frequencies are read from the file rather than calculated.

byte_frequencies = []
input_bytes = None
with open(self.filename, 'rb') as input_file:
    leaves_count = int.from_bytes(input_file.read(2), sys.byteorder)
    max_count_bytes = int.from_bytes(input_file.read(8), sys.byteorder)
    while leaves_count > 0:
        b = input_file.read(1)
        c = int.from_bytes(input_file.read(max_count_bytes), sys.byteorder)
        byte_frequencies.append((b,c))
        leaves_count -= 1
    input_bytes = input_file.read()
_, tree = buildTree(byte_frequencies)

Next the input bytes are turned back into a large integer, remembering to cut off the first bit we added for safety.

input_bits = bin(int.from_bytes(input_bytes, sys.byteorder))[3:]

Finally the original file is recreated by repeatedly traversing the tree and accumulating the bytes as they are found. The tree is traversed by starting at the root and if the current bit is '1', taking the left branch, or if the bit is '0' taking the right. When a Leaf is found, grab the byte value from that and add it to the list, then reset the search to the root.

Once all the bits are read, write out the file.

output_bytes = b''
current_node = tree
for bit in input_bits:
    if bit == LEFT:
        current_node = current_node.left
    else:
        current_node = current_node.right

    if type(current_node) is Leaf:
        output_bytes += current_node.data[0]
        current_node = tree

with open(self.outputname, 'wb') as output_file:
    output_file.write(output_bytes)

Compared to the algorithm I came up with, this algorithm works noticably better. For example, compressing compressor.py:

Compression type	File size (bytes)
No compression	3454
My algorithm	2854
Huffman coding	2095

While I was frustrated at the beginning, trying to understand the algorithm, in the end I had a good feeling of satisfaction when it worked. There are things I'd change in the code, to clean it up, some consistency in naming stuff, and deleting some unused things. I'd also like to change the code to work by streaming the files in, to be able to work with arbitrarily sized files, not limited to the amount of RAM available.

Thanks for reading!