DEV Community

loading...

did you ever ask yourself how file compression works ? the maths behind greedy algos

khalil snoussi
・4 min read

A little bit of background:

Yes, the foundation pillar of computer file compression is -spoiler alert- Maths, for images and videos which are basically just sequence of frames, it is slightly different because they support Lossy Compression techniques.
there are 2 kinds of compression techniques, Lossy and Lossless.

Lossy techniques as the name itself explains it, accept losing some of the original data after compression because we can still see and understand what happens in a video or a picture, these techniques cannot be applied to text because the meaning of the message will be lost, the receiver of your text won't understand the joke you wanted to tell him in the message for example.

Alt Text

Losseless techniques don't lose any valuable data after compression they are best suited for data that don't support losing some of its original information before compression like texts.

Lossless techniques for texts came to life because computers in the early 80's like Micro BBC (8 bits) had only a storage of max 800kb .. so can we please don't use a tremendous amount of bits ? as each character in the ascii table is coded in 8 bits, (1 bit = 0 or 1) so a text of 10 000 letters would weight 80 000 bits = 80 kb which is 10% of our storage disk. Our storage disk can only store a max size of 10 tiny text files ... do you feel the pain ?

However, a Mathematician called David Huffman found a way of compressing files which is called -spoiler alert- Huffman Coding, huffman coding is an algorithm of file compression which instead of representing each character (symbole in his research paper, 1995) in 8 bits, finds a way for representing each symbole in 1 bit 2 bits ,3 bits or even at max 8 bits or even 9 bits but those higher bits are for the least frequent symbols in the text so they don't really take that much space. the most frequent symbols in texts which are basically either 'e' or empty space ' ' are represented in 1 bit only.
Also, there is another problem in the normal 8 bits representation of each symbol in the ascii table, let's suppose 'a' representation is 0100001 and 'b' is 01000010
so 'ab' is 010000101000010 and this is unreadable the computer wouldn't know who is who in the binary sequence.


so much trouble ... jesus.

Alt Text

Algorithm:

let's first suppose we have an imaginary text right? and that we have a program that calculates the frequency of each character in the text and returns a dictionnary (Map) of each character and its frequency. -you might want to write this program for learning purpose-. and this program returns this.
inputs:
inputs = (a1,a2, ...,ap) p characters
weights= (w1,w2,...,wp) their p respective weights in the text
let's suppose we have inputs = ('a','b','c','d') and w =(10,15,30,16)

inputs = {'d':16,'c':30,'a':10,'b':15}

huffman algorithm outputs p respective codes, these are the new light representations of our characters.

output = (c1,c2,...,cp).

the algorithm finds output = (c1,c2,...,cp). by sorting the inputs based on their weights and by generating a sorted binary tree like the following:

Alt Text

and finally we generate a code table by traveling through the tree to find each character, if we go left you add 0, right you add 1 to your code until you find the character.

code = ""
if (root.left):
code = code + "0"
else:
code = code + "1"

char frequency code size
a 10 110 3 bits
b 15 111 3 bits
c 30 0 1 bit
d 16 10 2 bits

As i told you earlier the most frequent characters are encoded in 1 bit in this example 'c' is 0 which makes it a genius idea, and the least frequent characters are encoded in bigger number of bits ('a': 110). the compression effect here won't be very efficient because the file itself is very tiny it contains only 10+15+30+16 = 71 characters so before the compression the file would weight 71*8 = 568 bits, arround 0.55 kilobits, after the compression the file size is :
10*3+15*3+30*1+16*2 = 137 bits which means the file lost
568 - 137 = 431 bits arround 76% of its original size without even losing data worth it don't you think ?

here ends my first ever article on the internet, follow me on twitter for more, next time we will do the same to pictures and videos which is a little bit more complicated but fun ! that's why we are passionate about Coding, Also i will share my python implementation of this algorithm here, and on
Github repo

aslo, let's get connected and share more knowledge on:

Twitter
or:
Linkedin

def sort_dict(inputs):
    frequencies_list = list(inputs.keys())
    frequencies_list.sort()
    temp = []
    for  item in frequencies_list:
        temp.append(inputs[item])
    new_dict = dict(zip(frequencies_list,temp))
    return new_dict



def huffmanTree(inputs):
    temp = sort_dict(inputs)#temporary sorted dict
    condition_arg =  list(temp.keys())
    condition = len(condition_arg) > 1
    while condition:
        temp = sort_dict(temp)
        char = list(temp.values())
        freq = list(temp.keys())
        if len(char) > 1 and len(freq) > 1:
            freq0 = freq.pop(0)
            freq1 = freq.pop(0)
            value0 = char.pop(0)
            value1 = char.pop(0)
        else:
            break
        temp = {freq0+freq1:(value0,value1)}
        remain_temp = dict(zip(freq,char)) 
        temp.update(remain_temp)
    return temp


def huffman_encode(char,inputs):
    temp = huffmanTree(inputs)
    temp = list(temp.values())
    temp = temp[0]
    print(temp)
    search(temp,char,"")

def search(root,char,code):
    code = code
    if(len(root)>1):
        search(root[0],char,code+"0")
        search(root[1],char,code+"1")
    else:
        if (root == char):
            print(code)





inputs1 = {12:"c",16:"e",9:"b",13:"d",5:"a",16:"e",45:"f"}
input2 = {10:'a',15:'b',30:'c',16:'d',29:'e'}

huffman_encode('a',inputs1) #should print 010

Discussion (0)