DEV Community

Cover image for Data compression (Part 1) : Lossless Compression
Binoy Vijayan
Binoy Vijayan

Posted on • Updated on

Data compression (Part 1) : Lossless Compression

Data compression is the process of reducing the size of files or data streams to save storage space or decrease transmission time over a network. There are two main types of data compression: lossless compression and lossy compression.

Lossless Compression:

Lossless compression reduces the size of data without losing any information, ensuring that the original data can be perfectly reconstructed from the compressed version.
Examples of Lossless Compression Algorithms:

Run-Length Encoding (RLE) -

Replaces sequences of the same data value with a single value and a count.

Image description

Huffman Coding

Assigns variable-length codes to input characters, with shorter codes for more frequent characters.

Here is the high-level explanation of the algorithm

  • Calculate the frequency of each character in the string
  • Sort the characters in increasing order these are stored in a priority queue.
  • Create the Huffman tree(binary tree with minimum external path weight).
  • For each left edges mark ‘0’ and right edges mark ‘1’ in the tree.
  • Extract the code from the tree.
  • Encode the data with the new code.
  • Include the tree in the compressed file or data , which is for the decoding.

Huffman tree for the String ‘BCCABBDDAECCBBAEDDCC’

Image description

Here is the reference implementation in Swift for Huffman encoding/decoding

Lempel-Ziv Compression (LZ77, LZ78 and LZW)

Lempel-Ziv (LZ) compression is a family of lossless data compression algorithms that includes LZ77, LZ78 and LZW among others. LZ compression algorithms work by replacing repeated occurrences of data with references to a single copy. This family of algorithms is widely used and serves as the foundation for many popular compression formats and utilities.

LZ77

It was developed by Abraham Lempel and Jacob Ziv in 1977. The main idea behind LZ77 is to replace repeated occurrences of data with references to a single copy, thereby reducing the overall amount of data that needs to be stored or transmitted.
Here is the high-level explanation of the algorithm

LZ77 compression encodes the data as a sequence of triplets

Each triplet consists of:

Offset: This is an integer representing the distance back in the uncompressed data where a matching substring begins. It indicates how far back in the input stream the repeated sequence starts.

Length: This is an integer representing the length of the matching substring.

Next symbol: This is the next symbol (or character) in the input data that follows the matching substring.

The encoder examines the input sequence through a sliding window and generate the triplets.

The window consists of two parts:

A search buffer that contains a portion of the recently encoded sequence, and

A look-ahead buffer that contains the next portion of the sequence to be encoded.

In the diagram below, the search buffer contains eight symbols (in blue), while the look-ahead buffer contains nine symbols (in red). In practice, the sizes of the buffers are significantly larger.

Image description

LZ78

LZ78 is another data compression algorithm developed by Abraham Lempel and Jacob Ziv, following their earlier work on LZ77. LZ78 was introduced in 1978 as an improvement over LZ77.

The LZ78 compression process involves parsing the input data and building a dictionary of phrases encountered. When a new phrase is encountered, it is added to the dictionary, and its code is output. The process continues until the entire input data is processed.

LZ78 is known for its simplicity and ability to achieve good compression ratios on certain types of data. However, it may not always perform as well as other compression algorithms,
such as LZ77 or variants like DEFLATE, in practice. Despite this, LZ78 has influenced the development of subsequent compression algorithms and remains an important part of the history of data compression techniques.

Image description

LZW (Lempel-Ziv-Welch)

LZW (Lempel-Ziv-Welch) is a widely used compression algorithm that was developed by Abraham Lempel, Jacob Ziv, and Terry Welch. It is a dictionary-based compression algorithm that works by replacing repeated sequences of characters with shorter codes. LZW is commonly used in various file compression formats, such as GIF and TIFF.

LZW is known for its simplicity and effectiveness in compressing data with repetitive patterns. It is especially efficient for compressing text files and images with repeated patterns.

The basic idea behind LZW compression is to replace strings of characters with single code symbols. The algorithm does not do any analysis of the incoming text. It adds every new string of characters it recognises to a table of strings. Compression is performed when a single code is transmitted instead of a string of characters. It is not necessary to preserve the encoder’s dictionary in order to decode the compressed data stream.

LZW Compactor compresses text according to the following flow chart:

Image description

The decompression algorithm builds its own dictionary directly from input symbol stream. There is no need to transmit any translation table along with the encoded data. Similar to the compression algorithm, it adds a new string to its own dictionary when reading a new code. The decoding algorithm then translates each incoming code into a string of characters and sends it to the output. Note that the first 256 codes are already reserved to translate single character strings.

LZW Compactor extracts compressed text as specified in the following flow chart:

Image description

Burrows-Wheeler Transform (BWT) -

Burrows-Wheeler Transform (BWT) alone does not directly compress data; instead, it transforms the data to make it more amenable to compression by other techniques. One common approach is to use BWT in conjunction with other compression algorithms, such as Move-To-Front (MTF), Run-Length Encoding (RLE), and Huffman coding. The combination of these techniques can lead to effective compression.

Here's a high-level overview of how BWT compression is often performed:

Burrows-Wheeler Transform (BWT): - Apply the BWT to the input data, transforming it into a more compressible form by grouping similar characters together.
The Burrows-Wheeler Transform (BWT) involves several steps. Here's a breakdown of the process:
Block Division - Divide the input data into blocks, often of a fixed size. The blocks are processed independently.

*Cyclic Permutations * - For each block, generate all possible cyclic permutations of the characters in that block. A cyclic permutation involves shifting the characters to the right and wrapping around.

For example, given the string "abc," the cyclic permutations would be "abc," "cab," and "bca."

Sort Permutations - Sort the cyclic permutations lexicographically (in dictionary order). This results in a matrix where each row corresponds to a different cyclic permutation, and the matrix is sorted based on these permutations.

Select Last Column - Extract the last column of the sorted matrix. This column represents the characters at the last position of each sorted cyclic permutation.

Burrows-Wheeler Transform - The Burrows-Wheeler Transform is the concatenation of the characters in the last column. This transformation tends to group similar characters together, making the data more compressible.

Move-To-Front (MTF) - After the BWT, the transformed data is often subjected to the Move-To-Front algorithm. This algorithm maintains a list of symbols in a certain order, and when a symbol is encountered, it is moved to the front of the list.

The MTF step takes advantage of the fact that in the BWT output, similar characters are often grouped together. This step further enhances the compressibility.

Run-Length Encoding (RLE) - Run-Length Encoding is a simple compression technique that replaces sequences of repeated characters with a single occurrence of the character followed by the count of repetitions.

BWT-transformed data, especially after MTF, often contains long runs of identical characters, making it suitable for Run-Length Encoding.

Huffman Coding (or other entropy coding) - Apply a variable-length entropy coding technique like Huffman coding to further compress the data. Huffman coding assigns shorter codes to more frequently occurring symbols, reducing the overall length of the encoded data.

Top comments (0)