# A Primer on Data Compression

### Andrew ・11 min read

## Likely States

Let's talk about statistical mechanics for a second. Suppose we have a system of 4 particles, labeled `a`

through `d`

, each of which can be in one of two states. We can call them the `0`

state and the `1`

state. A particle has one "unit" of energy when it's in the `1`

state, and no energy when it's in the `0`

state.

```
1-energy: x x
0-energy: x x
particle: a b c d
```

If we only measure the overall energy of the system, we could find anywhere from `0`

units of energy (when all particles are in the `0`

state) to `4`

units of energy (when all particles are in the `1`

state).

```
1-energy: 1-energy: x x x x
0-energy: x x x x => 0-energy:
particle: a b c d particle: a b c d
--> total energy: 0 --> total energy: 4
```

Suppose we measure the energy of the system and we find `2`

units of energy. This tells us that exactly two particles are in the `1`

state, but it doesn't tell us *which* two particles are in that state. The overall energy of the system is known as a *macrostate*, while the energy of each individual particle (`a=0`

, `b=1`

, `c=0`

...) is known as the *microstate*. Individual microstates can map to the same macrostate, for instance...

```
abcd = 0101
1-energy: x x
0-energy: x x
particle: a b c d
--> total energy: 2
```

and

```
abcd = 1100
1-energy: x x
0-energy: x x
particle: a b c d
--> total energy: 2
```

...both have energy `2`

(the same macrostate), but which particle is in the `0`

or `1`

state differs between the two microstates. In a system like this, with only two energy states, particle `a`

can be in the `0`

state or the `1`

state (`2`

possible states), and (independently) particle `b`

can be in the `0`

state or the `1`

state (`2`

possible states), and so on. These `2`

s all multiply together and give us `2^N`

possible microstates for a `2`

-state, `N`

-particle system, where the particles are distinguishable (as they are in our case -- they're labeled).

If each particle has a 50/50 chance of being in the `0`

state or being in the `1`

state, then a `2`

-energy system is the most probable macrostate. With a 50/50 `0`

/`1`

probability, each *microstate* is equally likely and the most-probable, `2`

-energy macrostate simply has the most microstates which map to it:

```
abcd -> total energy
0000 -> 0
0001 -> 1
0010 -> 1 x
0011 -> 2 x
0100 -> 1 x x x
0101 -> 2 x x x
0110 -> 2 x x x
0111 -> 3 x x x x x
1000 -> 1 => 0 1 2 3 4
1001 -> 2
1010 -> 2 energy
1011 -> 3 histogram
1100 -> 2
1101 -> 3 2-energy state
1110 -> 3 occurs 6/16 times
1111 -> 4
```

While the `3`

-energy macrostate means that a given particle has a 75% chance of being in the `1`

position, the `2`

-energy state means there's an equal probability that any particle in the system will be in the `0`

state or the `1`

state, on average.

## So What?

Statistical mechanics lays the foundation for thermodynamics, which concerns itself with concepts like heat, pressure, and entropy -- the latter of which is directly related to theories of *information*. It seems strange that the laws of physics could have anything to do with information, which seems like a uniquely human concept, but entropy ties these two fields together.

In our four-particle example above, we can calculate the *entropy* of the system by simply applying the formula:

```
S = k * ln(H)
```

...where `H`

is the number of microstates and `k`

is the Boltzmann constant. `ln()`

is the natural (base-`e`

) logarithm. We assume that the environment has sufficient energy to give to our particles such that any microstate is possible, and equally likely. The entropy of the four-particle, two-state system is then:

```
S = k * ln(16)
=~ 2.77*k
```

This doesn't mean much until we compare it to the entropy of other systems. Entropy is, in a sense, related to information content. In physics, entropy is also related to how disordered a system is -- more disorder means more entropy. Let's look at another example. Suppose we cool the system down, removing all of the heat. In that case, there's probably not enough energy to bring any of the particles into the `1`

state, and all are stuck at `0`

. There is only one allowed microstate when the energy of the system is zero, and the entropy becomes:

```
S = k * ln(1)
= 0
```

The entropy is zero. The system is perfectly ordered; all particles are in the `0`

state and there is no chance that that will change.

## Bits of Entropy, Bits of Information

Let's move to bigger collections of particles. Let's say we have the entire alphabet except (and you'll see why later) `y`

and `z`

-- 24 particles labeled `a`

through `x`

-- and each of them can have `0`

energy or `1`

energy, again with a 50/50 probability to be in either state. Intuitively, which state seems less likely to occur by chance...

```
case (A):
100101011010100101011010
abcdefghijklmnopqrstuvwx
```

or

```
case (B):
111111111111000000000000
abcdefghijklmnopqrstuvwx
```

...? If you think that the `case (B)`

looks really unlikely, you're not alone. Most people intuit that the second case is somehow "less likely" than the first case -- even though we said each particle has a 50/50 chance of being in either state! From a purely probabilistic point of view, each of these states is *equally likely*. (It's also why you shouldn't play the lottery if you think the numbers 1-2-3-4-5 are unlikely to win. They're just as probable as any other 5-digit combination.)

This mis-intuition occurs because we've *ordered* the particles. If I'd given them to you as...

```
case (C):
101010101010100110101011
itjukvlwmxanbopcdqerfsgh
```

...does `case (C)`

that look more or less likely than `case (B)`

? (They're actually exactly equivalent.)

But if we insist on *distinguishable* particles, with some ordering, then we can "process" the particles by that ordering. And then **there is a serious difference between case (A) and case (B)**. It has to do with how "surprised" we are when we see a value different from ones we've seen in the past.

Let's process `case (A)`

from left-to-right (that is, alphabetically). We can begin by assuming a 50/50 probability for `0`

or `1`

states. Then, as we get new information, we'll adjust our expectations. The formula for the Shannon Entropy of a two-state system is:

```
H = -(p * log2(p)) - (q * log2(q))
```

...where `p`

is the probability of the `0`

state and `q`

is the probability of the `1`

state (or vice versa). If we begin by assuming that each state is equally likely, then the first bit of information (particle `a`

) gives us:

```
H = -(0.5 * log2(0.5)) - (0.5 * log2(0.5))
= -(0.5 * -1) - (0.5 * -1)
= -(-0.5) - (-0.5)
= 1
```

Exactly 1 bit of information! That's not too surprising, is it? The next bit is a bit trickier, though. We now have one data point (particle `a`

), and we're trying to extrapolate from that. Simple frequentist probability now tells us that since 100% of the (one) bits we've seen so far have been `1`

(`a=1`

), we expect all future particles to be in the `1`

state:

```
H = -(p * log2(p)) - (q * log2(q))
= -(0 * (-infinity)) - 0
= problem here
```

Really what we should do is use a Bayesian estimate of the probability. Very few things in life are 100% guaranteed to happen, so the formula above falls apart for all-or-nothing probabilities. Instead of 0% / 100% probabilities, let's let those probabilities get arbitrarily close to 0 and 1, but not actually reach them. Then what happens?

```
limit of H as e -> 0
H = -(e * log2(e)) - ((1-e) * log2(1-e))
= 0 - (1 * log2(1))
= 0 - 0 = 0
```

If all the particles / bits we've seen in the past have had the same value, then we expect the next one will also have that same value. We don't need to look at it to "know" its value. The next particle gives us `0`

bits of new information.

This changes, though, when we see a new value. After we've seen `a`

and `b`

of `case (A)`

, we now (by the frequentist approach) think that there's a 50/50 probability of either state, and `c`

gives us `1`

full bit of new information. Particle `d`

is where things really get interesting, though. At this point, we've seen `a=1`

, `b=0`

, and `c=0`

. So we assume that there's a 1/3 probability of seeing a `1`

in `d`

, and a 2/3 probability of a `0`

:

```
H = -(1/3 * log2(1/3)) - (2/3 * log2(2/3))
= -(1/3 * -1.585) - (2/3 * -0.585)
= -(-0.528) - (-0.390)
= 0.918
```

Particle `d`

gives us *slightly less than 1 bit of information*! This is because the "surprise" factor isn't so all-or-nothing, as it was when we assumed a 100% probability of seeing a

`1`

after `a`

, but it also won't be a *total*surprise, as it was for

`c`

, when we had a 50/50 shot of getting either a `0`

or a `1`

. We've seen both `0`

s and `1`

s before, but we've seen slightly more `0`

s. So we expect a `0`

, but there's a possibility that it could be a `1`

as well.We can work our way down each bit of `case (A)`

and calculate the amount (in bits) of "entropic" information contained within each particle / at each position:

```
case (A):
a:1 -> 1.000 bits (1/2 assumed probability of "1" energy state)
b:0 -> 0.000 bits (1/1 prior probability)
c:0 -> 1.000 bits (1/2 prior)
d:1 -> 0.918 bits (1/3)
e:0 -> 1.000 bits (2/4)
f:1 -> 0.970 bits (2/5)
g:0 -> 1.000 bits (3/6)
h:1 -> 0.985 bits (3/7)
i:1 -> 1.000 bits (4/8)
j:0 -> 0.991 bits (5/9)
k:1 -> 1.000 bits (5/10)
l:0 -> 0.994 bits (6/11)
m:1 -> 1.000 bits (6/12)
n:0 -> 0.996 bits (7/13)
o:0 -> 1.000 bits (7/14)
p:1 -> 0.997 bits (7/15)
q:0 -> 1.000 bits (8/16)
r:1 -> 0.997 bits (8/17)
s:0 -> 1.000 bits (9/18)
t:1 -> 0.998 bits (9/19)
u:1 -> 1.000 bits (10/20)
v:0 -> 0.998 bits (11/21)
w:1 -> 1.000 bits (11/22)
x:0 -> 0.999 bits (12/23)
```

Each particle gives, on average, about `0.96`

bits of information. Close to -- but not exactly -- `1`

. If we wanted to **compress** this information, there's not much we could do on a bit-by-bit basis. However, you might notice **patterns** in the data:

```
case (A):
10 01 01 01 10 10 10 01 01 01 10 10
ab cd ef gh ij kl mn op qr st uv wx
```

This stream of bits is *clearly* compressible in that, if we break it up sequentially into 2-bit blocks, we never see the patterns `00`

or `11`

. This means we could encode this stream as:

```
case (A):
B A A A B B B A A A B B
10 01 01 01 10 10 10 01 01 01 10 10
ab cd ef gh ij kl mn op qr st uv wx
where A = 01, B = 10
```

We would, of course, have to know this encoding in order to *decode* the data, but it would save space in storing the information. Interestingly, this representation actually contains *less* entropic information on a bit-per-bit basis than the uncompressed version (average `0.89`

):

```
case (A):
ab:B -> 1.000 bits (1/2 assumed)
cd:A -> 0.000 bits (1/1 prior)
ef:A -> 1.000 bits (1/2)
gh:A -> 0.918296 bits (1/3)
ij:B -> 0.811278 bits (1/4)
kl:B -> 0.970951 bits (2/5)
mn:B -> 1.000 bits (3/6)
op:A -> 0.9985228 bits (4/7)
qr:A -> 1.000 bits (4/8)
st:A -> 0.991076 bits (4/9)
uv:B -> 0.970951 bits (4/10)
wx:B -> 0.99403 bits (5/11)
```

The reason for this is those long strings of `0`

s and `1`

s in the encoding. This series can actually be compressed even further, as we could next allow the substitutions `C=BAA`

and `D=ABB`

to get:

```
case (A):
C D C D
B A A A B B B A A A B B
10 01 01 01 10 10 10 01 01 01 10 10
ab cd ef gh ij kl mn op qr st uv wx
```

This is, in a nutshell, how information compression works for digital files. You try to find common patterns in the data, and replace those patterns with shorter representations, which can later be decoded to restore the original file.

This works well for the stream of bytes above, because at each step, there are only two (or fewer) possible "super-patterns". These could be encoded into a single byte. At each step, the data would be compressed by 50%. If we had a single `00`

or `11`

string in the original bit pattern, though, the compression wouldn't work:

```
case (D):
?? D C D
B A ? A B B B A A A B B
10 01 11 01 10 10 10 01 01 01 10 10
ab cd ef gh ij kl mn op qr st uv wx
```

This single `11`

pattern means we need a new pattern at the first level of compression, which would require an extra bit, defeating the purpose of compression in the first place. This means we can't compress the data in 2-bit packages *at all*. (Though it's possible that some other coding mechanism could do the trick.)

I said earlier that there was an obvious difference between `case (A)`

and `case (B)`

, and we can see it clearly if we calculate the amount of entropy contained in each particle for `case (B)`

:

```
case (A):
a:1 -> 1.000 bits (1/2 assumed probability of "1" energy state)
b:1 -> 0.000 bits (1/1 prior probability)
c:1 -> 0.000 bits (2/2 prior)
d:1 -> 0.000 bits (3/3)
e:1 -> 0.000 bits (4/4)
f:1 -> 0.000 bits (5/5)
g:1 -> 0.000 bits (6/6)
h:1 -> 0.000 bits (7/7)
i:1 -> 0.000 bits (8/8)
j:1 -> 0.000 bits (9/9)
k:1 -> 0.000 bits (10/10)
l:1 -> 0.000 bits (11/11)
m:0 -> 0.000 bits (12/12)
n:0 -> 0.391244 bits (12/13)
o:0 -> 0.591673 bits (12/14)
p:0 -> 0.721928 bits (12/15)
q:0 -> 0.811278 bits (12/16)
r:0 -> 0.873981 bits (12/17)
s:0 -> 0.918296 bits (12/18)
t:0 -> 0.949452 bits (12/19)
u:0 -> 0.970951 bits (12/20)
v:0 -> 0.985228 bits (12/21)
w:0 -> 0.994030 bits (12/22)
x:0 -> 0.998636 bits (12/23)
```

...only `0.43`

bits of information, on average, per particle. Long streams of `0`

s and `1`

s are also ripe for compression.

## Summary

So what have we learned? Information content, at a physical level, is related to the disorder inherent in a system. The more disordered a system, the more information it contains. A file consisting of a million zeroes is perfectly ordered, but it contains little information.

Data compression relies on the fact that, within files, there are sometimes repeated strings of bits or characters which can be encoded so as to minimize the size of the encoded file. "Information density" (in the physical / entropic sense) contained within bits of digital files is maximized when there is no discernable pattern or regular, repeated runs of bits / characters. Although two files may contain the same *number* of `1`

s and `0`

s, requiring the same amount of *storage space* on disk, the order in which these `1`

s and `0`

s appear in the file is crucial. Maximum information density presents itself as a file full of completely random `1`

s and `0`

s.