DEV Community

Matt Ellen
Matt Ellen

Posted on

The basic basics of compression

For a long time I've been interested in learning about information theory. I even have books on the subject that I've tried to read, but as soon as they step out of the analogies and into the maths I get lost.

Never-the-less I have managed to understand one small part of it. The one thing I got my head around is that because standard ways of storing certain types of information (e.g. ASCII character data) have overhead (e.g. every ASCII character takes up 1 byte) and if your document doesn't need all of the overhead (e.g. you only use the 'a' to 'z' ASCII characters), you can remap the information to have a smaller footprint (e.g. each character can take up ¼ of a byte).

In order to prove to myself that I understand this idea I have written a little script to compress text files in python 3:

GitHub logo Mellen / file_compression

A python 3 programme for compressing files

File Compression

A exploratory look at how to compress files.

tc.py uses my basic understanding of how losses compression should work (see https://dev.to/mellen/the-basic-basics-of-compression-lap to see how I came up with it.)

bc.py is an implementation of Huffman coding, that works for any type of file. It is a lot better at compression compared to my basic implementation.

(Requires python 3)

tc.py is a text file compression programme

 usage: tc.py [-h] [-d] file name

 positional arguments:  
    file name   The name of the file to be processed.

 optional arguments:  
    -h, --help  show this help message and exit  
    -d          Decompress the file (defaults to compressing).

bc.py is a compression programme for any type of file

 usage: bc.py [-h] [-d] file name

 positional arguments:  
    file name   The name of the file to be processed.

 optional arguments:  
    -h, --help  show this help message and exit  
    -d          Decompress the file (defaults to compressing).



It is certainly not as good as ¼ compression. In fact the compression is barely noticeable, but it is there. Let's take a look at how it works (compressor.py):

First we have to figure out the unique characters in the document:

with open(self.filename, 'r') as textfile:
            text = textfile.read()

chars = ''.join(set(text))

The inbuilt set data type makes this very easy. We turn the set back into a string because we need to preserve the order of the characters.

From this we can figure out how many bits we need (the bit length) to represent each character:

bl = len(chars).bit_length()

After this I fairly arbitrarily decided to compresses the characters into a list of ~64 bit numbers. In hindsight, python would allow me to fit all the characters into an integer of arbitrary size, so I didn't need to make a list, but that's what I've done:

max_inserts = 64//bl;
current_insert = 0
value = ''
values = []
for c in text:
    n = chars.index(c) + 1
    b = (bin(n)[2:]).zfill(bl)
    value += b
    current_insert += 1
    if current_insert == max_inserts:
        values.append(int(value, 2))
        current_insert = 0
        value = ''

if current_insert < max_inserts and current_insert > 0:
    values.append(int(value, 2))

That's a dense chunk of code, let's break it down. So first we set some initial values. max_inserts tells us how many characters we can fit in a 64 bit integer. current_insert is how many characters we've inserted into the integer. value is the integer, represented as a string of 1s and 0s. values is the list of integers.

Next we go through the text of the file one character at a time. First we get the index of the character in the list of unique characters. Then we turn that number into its binary representation (zero filling to the bit length).

The binary representations all need to be the same length so that we can extract the data later on from the file. If the values were different length there'd need to be a way to separate them, which would pad the values anyway, or there would need to be some complex way to put the values together, and I couldn't think of one. In fact, if you look through the git log of the repo, you'll see I tried different things until I got here.

Once the index is in binary, we append it to the current integer. Then we check to see if the current integer is long enough. If it is we add it to the list of integers and reset for the next integer. Right at the end we add the last integer we were working on in case we missed it.

That's it. Decompression is basically the same in reverse.

I know there is a lot more to compression than this, I just don't understand it (yet?). I do want to improve this to work with binary files. I don't think that should be too difficult. After that I think the next step should be to figure out how to compress files based on how often a pattern appears. The idea being that the character or group of bytes that appears most often should be assigned the smallest value, and so on, but I haven't figured out how to store that information in a compressed way.

Discussion (3)

Collapse
sudhan96 profile image
raja sudhan

Nice article, look into Huffman encoding it kind of does the same thing for binary files and the particular algorithm was invented when he was a student.

Collapse
mellen profile image
Matt Ellen Author

Thanks! I will.

Collapse
mfcarneiro profile image
Matheus Felipe

In a past few days this same feeling of interesting about file compression came to me. It's really awesome.

Thanks for sharing this Matt!