How to implement a simple lossless compression in C++

#cpp #compression

Compression algorithms are one of the most important computer science discoveries. It enables us to save data using less space and transfer it faster. Moreover, compression techniques are so enhanced that even lossy compressions give us an unnoticeable loss of the data being managed. Nevertheless, we are not going to talk about lossy compression algorithms, but loss-less algorithms, in particular, a very famous one called Huffman Encoding. You may know it because it is used in the JPEG image compression. In this post we will discover the magic behind this compression algorithm, we will go step by step until we end up designing a very simple implementation in C++.

Prefix property

Huffman encoding is a code system based on the prefix property. To encode a text we can move each distinct character of the text into a set, thus we will have a set of characters. To compress each symbol we need a function that is able to convert a character into code (e.g. a binary string). Given a set of symbols Σ we can define a function ϕ: Σ → {0,1}+ that maps each symbol into a code. The symbols in Σ contain the set of distinct characters in the text that needs to be compressed. The most simple prefix encoding would be to assign each letter a binary number, which is a simple ASCII to binary integer conversion. This is a very simple encoding since it is a function that maps a character to itself, but it surely does not compress at all. Prefix codes are very easy to decode, they only need to be read (left-to-right), this ensures us a decompression runtime complexity of O(n). A common way to represent this type of encoding is in a binary tree called the prefix tree.
For example, let's suppose we have the following set and encoding scheme.

Symbols: Σ = {A, B, C D}
Encoding: ϕ(A) = 1, ϕ(B)=01, ϕ(C)=000, ϕ(D)=001 then we can represent it using this

As we can see in the tree, to decode/encode a text (e.g. 00010010….) we must traverse the tree until we find a leaf (where the character is found). If the current prefix is a 0 we must go left and if it is a 1 we must go right. That simple!

After creating the tree it is easier to save the equivalencies (code — character) in a simple table.

A prefix tree has the following properties:

One leaf per symbol
Left edge labeled 0 and right edge labeled 1
Labels on the path from the root to a leaf specify the code for that leaf.

Encoding

Okay, so what do these strange prefix trees have to do we Huffman trees? Well, it turns out Huffman trees are prefix trees, but not just simple prefix trees, they represent the optimal prefix trees. Given a text, an optimal prefix code is a prefix code that minimizes the total number of bits needed to encode that text, in other words, it is the encoding that makes the text smaller (fewer bits = more compression). Note that if you are using a Huffman tree to compress data you should also save the tree in which it was encoded.

Now, how do we find this optimal tree? Well, we need to follow the following steps.

Find the frequencies of each character and save them in a table
For each character, we create a prefix tree consisting of only the leaf node. This node should contain the value of the character and its frequency in the text.
We should have a list of trees now, one per character. Next, we are going to select the two smallest trees, we consider a tree to be smaller to another one if its frequency is lower (in case of a tie we select the one with fewer nodes), and we are going to merge them into one; that is one of the two should become the left subtree and one the right subtree, afterward, a new parent node is created.

Well, that's it, after joining every tree you should be left with only one. If you were paying attention you must have noticed that I didn’t specify how to select the smaller tree from the list of all the trees. That is because it depends on the implementation. The fast way to do it is saving the trees in a MinHeap (priority queue in C++), each insertion and deletion in the heap has an O(log n) complexity but the lookup is always constant. Thus the total complexity of the encoding algorithm is O(n log n) because we must insert a new tree n times.

Implementation

The Huffman compression algorithm is a greedy algorithm, that is it always tries to make the optimal choice in a local space, to implement we can create a class called HuffmanTree.

class HuffmanTree{
public:

    HuffmanTree(char v, int w);

    HuffmanTree(HuffmanTree const& tree);

    HuffmanTree(HuffmanTree const& h1, HuffmanTree const& h2);

    bool operator<(HuffmanTree const& other) const;

private:

    // represents a value that will never be read;
    static const int NULL_VALUE = -1;

    // left subtree
    std::shared_ptr<HuffmanTree> left;

    // right subtree
    std::shared_ptr<HuffmanTree> right;

    char value; // character, null if !isLeaf 
    int weight; // aka. frequency 
    int size; // aka. number of nodes 
    bool isLeaf;
};

A HuffmanTree will contain, as we said before, the value (character), its weight (frequency), and the size (number of nodes). Finally, it also has a pointer to the left subtree and the right subtree, we used a shared pointer to promote modern C++ smart pointers and avoid worrying about memory leaks.

You may be wondering why would we want to implement three different constructors? Well, the first one creates a new tree with a given value and weight.

HuffmanTree::HuffmanTree(char v, int w){
    value = v;
    left = nullptr;
    right = nullptr;
    weight = w;
    size = 1;
    isLeaf = true;
}

The second constructor is just a copy constructor, that creates a new one based on the old one.

HuffmanTree::HuffmanTree(HuffmanTree const& tree){
    value = tree.value;
    left = tree.left;
    right = tree.right;
    weight = tree.weight;
    size = tree.size;
    isLeaf = tree.isLeaf;
}

Finally, we need a constructor that merges two different trees.

HuffmanTree::HuffmanTree(HuffmanTree const& h1, HuffmanTree const& h2) {
    left = std::make_shared<HuffmanTree>(h1);
    right =  std::make_shared<HuffmanTree>(h2);
    size = left->size  + right->size;
    weight = left->weight + right->weight;
    isLeaf = false;
    value = NULL_VALUE;
}

The HuffmanTree class has overloaded a comparison operator, but if you were paying attention, it should be self-explanatory.

bool HuffmanTree::operator<(HuffmanTree const& other) const{
    if(weight != other.weight)return weight < other.weight;
    else return size < other.size;
}

Finally, we need to make the core of the algorithm, as you can see we first create a HuffmanTree per character, then we merge trees until we are only left with one.

    ...

    std::priority_queue<HuffmanTree> minHeap;

    for(auto const& letter : table){
        minHeap.push(HuffmanTree(letter.first, letter.second)); // first == char, second == frequency 
    }

    // join trees
    while(minHeap.size() > 1){
        HuffmanTree min1 = minHeap.top();
        minHeap.pop();
        HuffmanTree min2 = minHeap.top();
        minHeap.pop();
        minHeap.push(HuffmanTree(min1, min2));
    }