Roughly a month ago I found myself puzzled at work as to why 7-ZIP in Windows could not recognize GZIP files that we compressed in python. We used zlib library, which claims to be "Compression compatible with GZIP." It seemed there was more than meets the eye.
After a few google searches that left me more confused about what was going on than I initially was I decided to just open a subprocess and use the GNU implementation of GZIP. The final code wasn't too long and compressed in a way that 7-ZIP as well as Snowflake were able to detect automatically. The implementation was quite succinct, so I didn't give it much thought.
def gnu_zip(data: bytes) -> bytes: p = subprocess.run(['gzip'], input=data, stdout=subprocess.PIPE) if p.stderr: logging.error(p.stderr) raise Exception('Error decompressing data') return p.stdout
Still, something bugged me. How did zlib and GZIP relate? Why did zlib claim to be compatible with GZIP when it was clearly not the same format? We also noticed that even though Snowflake was unable to detect the zlib compressed file, if we told it that it's a GZIPped file it was able to load it without issues. Clearly the claim of compatibility wasn't completely outlandish. After a few weeks I decided to dive in and investigate in depth. The following is a summary of what I learned.
I was surprised to find out that GZIP, zlib or even ZIP are not compression algorithms, they are actually file formats that can permit different compression algorithms. Even more surprising, virtually every implementation of those three actually use the same lossless data compression algorithm. This algorithm is called DEFLATE.
So are GZIP, zlib and ZIP actually the same? Not quite, and the final size of the compressed file can vary significantly between ZIP and the other two for reasons we will see below, but under the hood the actual compression is done in exactly the same way.
How does DEFLATE work? In short, it takes some input data as a stream consisting of a series of blocks of data, then uses a combination of LZ77 algorithm and Huffman coding on each block.
LZ77 identifies duplicate strings and replaces them with a back reference, which is a pointer to the place where it previously appeared, followed by the length of the string. This is done on the raw data blocks. For a more detailed explanation and example see Bonus section 1.
Huffman coding is known as bit reduction, and identifies the commonly used symbols and replaces them with symbols with shorter bit sequences. Infrequently used symbols will be represented with longer bit sequences. This is done on the LZ77 compressed blocks. For a more detailed explanation and example see Bonus section 2.
See the following link for a more in depth explanation of deflate: https://zlib.net/feldspar.html
Released in 1989 and written by Phil Katz, ZIP is the oldest of the compression formats discussed in this article. It is also unique in that it's an archive file format, meaning it can compress multiple files and entire directory structures.
ZIP applies DEFLATE compression separately to each file it stores and then keeps a central directory structure at the end. This means that it can provide random access to each file which can be read separately, but the final size is larger since the compression does not take advantage of redundancy across files.
Finally, it also includes a CRC-32 checksum for data integrity.
After some patent disputes with the unix compress utility, the GZIP format was developed in 1992 by Jean-loup Gailly and Mark Adler using a new implementation of DEFLATE that did not infringe on any patents.
Unlike ZIP, it is not an archive format. This means it can not compress several files or directories, it just compresses a single file or stream of data. That's why it's frequently combined with the tar utility which can create an archive of files, directories and their attributes in a single file which is then compressed with GZIP. This popular format is called tarball and its files end in .tar.gz. Tarballs do not provide access to the files contained, instead the whole file needs to be read and decompressed in memory before the directory structure can be shown.
It has a GZIP wrapper on the compressed data with the filename and other system information, and a CRC-32 checksum at the end to check the integrity of the data. On the other hand, the final size is usually smaller than zip since it does take advantage of redundancy across files.
import gzip from io import BytesIO def gzip_data(data: bytes) -> bytes: out = BytesIO() with gzip.GzipFile(fileobj=out, mode="wb") as f: f.write(data) return out.getvalue()
The authors of GZIP later extracted its DEFLATE implementation into a library named zlib so it could be reused by other formats, most notably PNG images. PNG images replaced the GIF format that was plagued with the same patent issues as unix compress. It is the most popular DEFLATE implementation and is used by many existing programs. Most HTTP servers use zlib to compress their data.
But that's not all, zlib has the option to use a GZIP wrapper on the compressed data or a lighter zlib wrapper. This means that apart from being a library, zlib can also be considered a compression format that has separate headers from other formats. This is the reason why our files were compatible with GZIP encoding but Snowflake couldn't auto detect them as GZIP, since they had zlib headers. This format is a light wrapper over raw deflate and does not contain a CRC checksum.
import zlib def zlib_data(data: bytes) -> bytes: return zlib.compress(data)
|Data||Stream / single file||Stream / single file||Archive files and directories|
You can check the file type with the following code:
HEADERS = ( ('zlib-no-compression', '7801'), ('zlib-default-compression', '789c'), ('zlib-best-compression', '78da'), ('gzip', '1f8b'), # 1f8b08 if it's using deflate (almost always) ('zip', '504b0304'), ) def compression_type(data: bytes) -> str: for compression, header in HEADERS: if data.hex().startswith(header): return compression
LZ77 is a lossless compression algorithm that replaces a sequence of symbols which had already appeared previously with a pointer to the place it last appeared and a number indicating the length of the sequence. The notation is where B is the pointer that indicates how many symbols ago the sequence appeared, and L is the sequence length.
LZ77 does not keep a dictionary of sequences (in contrast to LZ78), but instead uses a sliding window to search for them. This means that it only looks back inside the data up to a fixed distance (window). For a more detailed explanation click on the following link.
Huffman coding is also a lossless compression algorithm. The main idea is that symbols that appear more often should be encoded with less bits than symbols that appear little, resulting in a shorter file overall.
Starting with the less used symbols, a leaf node is created for each of them and a tree is created by by joining them together with a parent node whose value is the sum of their frequencies. This process is repeated and we keep joining the less frequent symbols or subtrees until we get a final tree of which the leaves are the symbols. To know how to encode a symbol we need to traverse the subtree where each left node represents a 0 and each right node represents a 1 until we reach the leaf node representing the symbol.
Some great stack overflow answers by Mark Adler, co-author of GZIP and Zlib.