Huffman coding is a pretty straight-forward lossless compression algorithm first described in 1992 by David Huffman. It utilizes a binary tree as its base and it's quite an easy to grasp algorithm. In this post we walk through implementing a Huffman coder and decoder from scratch using Elixir ⚗️

## How does Huffman work?

In this example we assume the data we are compressing is a piece of text, as text lends itself very well for compression due to repetition of characters.

The algorithm in its core works by building a binary tree based on the frequency of the individual characters. Placing characters with a higher frequency closer to the root of the tree than characters with a lower one.

The Huffman code can be derived by walking the tree until we find the character. Every time we pick the left-child, we write down a `0`

and for every right-child a `1`

. Repeat the process for each character in the text, and voila!

Decoding the text is as simple as starting at the root of the tree and traverse down the left-child for every `0`

you encounter and pick the right-child for every `1`

. Once you hit a character, write it down and start over with the remainder of the bits.

For an excellent explanation on the Huffman Coding algorithm, give this explainer video by Tom Scott a watch!

## Let's get to work: Setting up the project

Ok! Let's begin by generating a new Mix project using `mix new huffman`

. `cd huffman`

into the project root and open up the `lib/huffman.ex`

file. This file contains our `Huffman`

module. Replace the contents of the file with a single function declaration `encode/1`

like so:

```
defmodule Huffman do
def encode(text \\ "cheesecake") do
end
end
```

This function accepts the text we like to compress and defaults to "cheesecake" because, why not 🤷

## Frequency analysis

The first step in Huffman coding is a simple frequency analysis. For each character in the given text, count how many times it is used in the text. This is a rather crucial part of the encoding algorithm as this determines where the characters are placed in the binary tree.

For example; when encoding the word "cheesecake" we can already see that certain characters appear more than others. `e`

is used four times whereas the `c`

appears only twice.

Let's write some code that does the frequency analysis for us. Update our `encode/1`

function to match the following:

```
def encode(text \\ "cheesecake") do
frequencies =
text
|> String.graphemes()
|> Enum.reduce(%{}, fn char, map ->
Map.update(map, char, 1, fn val -> val + 1 end)
end)
end
```

This code above splits the input string into graphemes (individual characters) and stores the count per character in a map using Map.update/4.

Popping into an `iex`

shell and running `Huffman.encode`

now gives us the following output:

```
%{
"a" => 1,
"c" => 2,
"e" => 4,
"h" => 1,
"k" => 1,
"s" => 1
}
```

## Sorting the characters

Since order is important when building our tree, we need to sort the list of characters by their frequency. Luckily Elixir's Enum module comes to the rescue. Add the following code to our `Huffman.encode/1`

function:

```
queue =
frequencies
|> Enum.sort_by(fn {_char, frequency} ->
frequency
end)
```

By passing the map with frequencies to `Enum.sort_by/2`

, we sort the map using its values (the frequencies). This outputs:

```
[
{"a", 1},
{"h", 1},
{"k", 1},
{"s", 1},
{"c", 2},
{"e", 4}
]
```

## Building the tree

Before we can neatly build a tree, we need to determine which kinds of nodes we have. We have a `Leaf`

node which holds a single character and a `Node`

which contains a left- and a right child. Each child on its own can be another `Node`

or a `Leaf`

.

Let's define two structs to model these types of nodes at the top or our module:

```
defmodule Node do
defstruct [:left, :right]
end
defmodule Leaf do
defstruct [:value]
end
```

*Note:* Make sure you define the two structs within our current `Huffman`

module to avoid naming clashes with the built-in `Node`

.

The next step is to iterate over our list of sorted characters and convert them to `Leaf`

nodes. Extend our `Huffman.encode/1`

function like this:

```
queue =
frequencies
|> Enum.sort_by(fn {_node, frequency} -> frequency end)
+ |> Enum.map(fn {value, frequency} ->
+ {
+ %Leaf{value: value},
+ frequency
+ }
+ end)
```

Now that we have an ordered set of `Leaf`

nodes, we can build up the rest of the binary tree. Define a new function to do exactly this:

```
defp build([{root, _freq}]), do: root
defp build(queue) do
[{node_a, freq_a} | queue] = queue
[{node_b, freq_b} | queue] = queue
new_node = %Node{
left: node_a,
right: node_b
}
total = freq_a + freq_b
queue = [{new_node, total}] ++ queue
queue
|> Enum.sort_by(fn {_node, frequency} -> frequency end)
|> build()
end
```

Since this is a recursive function we need an exit condition, otherwise we are looping forever. In our case we are done building the tree when there is only a single root-node in the queue. The second clause is a bit more beefy, let's break it down:

First thing we do is pop two nodes of the queue by pattern matching the tuple

`{node, frequency}`

as the head of the list, and matching the*rest*of the list again as`queue`

. Doing this twice gives us two nodes and their frequencies we can work with.The next step is combining these two nodes in a parent

`Node`

and tallying up the frequencies so we can prepend it to the queue.The next step is sorting the queue again to make sure the items with the lowest frequency are at the head of the queue.

The last step is to call

`build/1`

again so this process starts all over again.

Calling our `build/1`

function at the end of our `Huffman.encode/1`

function passing the queue of items results in:

```
%Huffman.Node{
left: %Huffman.Leaf{value: "e"},
right: %Huffman.Node{
left: %Huffman.Leaf{value: "c"},
right: %Huffman.Node{
left: %Huffman.Node{
left: %Huffman.Leaf{value: "k"},
right: %Huffman.Leaf{value: "s"}
},
right: %Huffman.Node{
left: %Huffman.Leaf{value: "a"},
right: %Huffman.Leaf{value: "h"}
}
}
}
}
```

The output above shows a tree structure where the character `e`

is placed more towards the root of the tree and the character `a`

is placed more towards the bottom. Each node in our tree is either a `Leaf`

with the actual character or another `Node`

. Drawn out this looks like:

## Encoding the text

Now that we have the Huffman tree all setup and ready to go, let's focus on encoding the text using Huffman coding. The process of doing so is quite simple:

Look up each character of the text in our binary tree and keep track of each step while traversing down the tree. Each time we traverse down the left-child, write down a `0`

and each time we traverse down the right-child write down a `1`

. Once we find the character in the tree, the path we wrote down is our Huffman code.

### An example

So for example, in our tree above, the letter "c" has the code `10`

since at the first node we take the right branch and then the left branch and voila. Similarly for the character "s" the code is `1101`

since for the first two nodes, we take the right branch, then the left and lastly the right one again.

### Code

First things first, we should be able to look up a character in the binary tree and keep track of its path while doing so.

```
defp find(tree, character, path \\ <<>>)
defp find(%Leaf{value: value}, character, path) do
case value do
^character -> path
_ -> nil
end
end
defp find(%Node{left: left, right: right}, character, path) do
find(left, character, <<path::bitstring, 0::size(1)>>) ||
find(right, character, <<path::bitstring, 1::size(1)>>)
end
```

In the code example above we defined a function with two different clauses and a function header. The function header is just so that we can define a default value for the path which we set to an empty binary.

The first clause executes when passed in a `Leaf`

node. The only thing we need to do is compare its value to the character we are looking for. When they match, return the path, else return `nil`

.

The second clause matches on `Node`

s and calls `find/3`

with each child of the node, passing in an updated path. So when recursing on the left-child, we update the `path`

with a `0`

. The `||`

(or) operator makes sure that either of the two paths is returned.

Now that we can look up a single character in our binary tree, let's iterate over all the characters in the text and replace them with their Huffman code. Add the following function to our `Huffman`

module:

```
defp convert(text, tree) do
text
|> String.graphemes()
|> Enum.reduce(<<>>, fn character, binary ->
code = find(tree, character)
<<binary::bitstring, code::bitstring>>
end)
end
```

This function simply breaks the text into graphemes and iterates over each grapheme by calling `Enum.reduce/3`

. Starting off with an empty binary, we can simply append to it for each character we find.

Now wire it all up and call our `convert/2`

function at the bottom of our `Huffman.encode/1`

function passing the original text and the tree.

The `encode/1`

function now looks like:

```
def encode(text \\ "cheesecake") do
frequencies =
text
|> String.graphemes()
|> Enum.reduce(%{}, fn char, map ->
Map.update(map, char, 1, fn val -> val + 1 end)
end)
queue =
frequencies
|> Enum.sort_by(fn {_node, frequency} -> frequency end)
|> Enum.map(fn {value, frequency} -> {%Leaf{value: value}, frequency} end)
tree = build(queue)
{tree, convert(text, tree)}
end
```

As you can see it returns a tuple containing the tree and the encoded text. It returns the tree because we need this in the next step to decode the text to its original form again.

Calling `Huffman.encode`

now results in the following:

```
{
%Huffman.Node{
left: %Huffman.Leaf{value: "e"},
right: %Huffman.Node{
left: %Huffman.Leaf{value: "c"},
right: %Huffman.Node{
left: %Huffman.Node{
left: %Huffman.Leaf{value: "k"},
right: %Huffman.Leaf{value: "s"}
},
right: %Huffman.Node{
left: %Huffman.Leaf{value: "a"},
right: %Huffman.Leaf{value: "h"}
}
}
}
}, <<188, 213, 216>>}
```

Where the second element of the tuple (`<<188, 213, 216>>`

) is our compressed data.

### Quick size comparison

```
iex(1)> <<188, 213, 216>> |> bit_size()
24 # variable bit length per char, but max 4 bits in this scenario
iex(2)> "cheesecake" |> bit_size()
80 # 10 chars * 8 bits per char = 80 bits
```

Using our Huffman module, encoding the string "cheesecake" results in a binary blob of *only* 24 bits. Each character in our binary tree is reachable in max. four steps, so each character is encoded in 4 bits or less. Using ASCII encoding each character takes 8 bits.

However, in order to decompress the text, we need access to the same tree we used to compress the text. Meaning that if we want to send the compressed data over the wire or store it to disk, we need need to account for additional space for the tree too.

## Decoding

Now that we have coded ourselves the ability to encode text using Huffman coding, we also should have a way to decode the data back to its original form.

Decoding the data is quite straight-forward: all we need is our original Huffman tree and the encoded binary blob from the previous steps.

We use the binary blob as some sort of turn-by-turn navigation through our binary tree. From the top of the tree, traverse down the left-child when encountering a `0`

in the data and traverse down the right-child when finding a `1`

.

Repeat this process until we hit a `Leaf`

node. Write down the value of the `Leaf`

and repeat with the remainder of the bits.

### Code

First things first, we need a function that can walk the tree following the instructions in the compressed data, until we hit a `Leaf`

node.

```
defp walk(binary, %Leaf{value: value}), do: {binary, value}
defp walk(<<0::size(1), rest::bitstring>>, %Node{left: left}) do
walk(rest, left)
end
defp walk(<<1::size(1), rest::bitstring>>, %Node{right: right}) do
walk(rest, right)
end
```

In the code above we define a `walk/2`

function that takes the encoded binary and the tree and walks the tree. For every `Node`

it encounters it looks at the next *bit* of the data to determine whether it should traverse down the first-child (found a `0`

) or traverse down the right-child (found a `1`

).

It keeps doing that until it finds a `Leaf`

node, then it returns the remainder of the data and the character that belongs to that code.

Now that we have a function that can walk to the first available `Leaf`

node following the instructions of the data, we want to call this function recursively as long as there's more binary data available.

Let's define another function:

```
def decode(tree, data, result \\ [])
def decode(_tree, <<>>, result),
do: List.to_string(result)
def decode(tree, data, result) do
{rest, value} = walk(data, tree)
decode(tree, rest, result ++ [value])
end
```

The `decode/3`

function above has two clauses. The last one basically walks the tree until it finds the first `Leaf`

node, stores its character value in a list (`result`

) and recurses with the remainder of the binary data.

The first clause matches once we have no binary data left to decode. It passes the result to `List.to_string/1`

concatenating the items in the list together forming a string.

## The end result

That's all there is to it! We built ourselves a Huffman Coder and Decoder using ~90 lines of Elixir:

```
defmodule Huffman do
defmodule Node do
defstruct [:left, :right]
end
defmodule Leaf do
defstruct [:value]
end
def encode(text \\ "cheesecake") do
frequencies =
text
|> String.graphemes()
|> Enum.reduce(%{}, fn char, map ->
Map.update(map, char, 1, fn val -> val + 1 end)
end)
queue =
frequencies
|> Enum.sort_by(fn {_node, frequency} -> frequency end)
|> Enum.map(fn {value, frequency} -> {%Leaf{value: value}, frequency} end)
tree = build(queue)
{tree, convert(text, tree)}
end
def decode(tree, data, result \\ [])
def decode(_tree, <<>>, result),
do: List.to_string(result)
def decode(tree, data, result) do
{rest, value} = walk(data, tree)
decode(tree, rest, result ++ [value])
end
defp walk(binary, %Leaf{value: value}), do: {binary, value}
defp walk(<<0::size(1), rest::bitstring>>, %Node{left: left}) do
walk(rest, left)
end
defp walk(<<1::size(1), rest::bitstring>>, %Node{right: right}) do
walk(rest, right)
end
defp build([{root, _freq}]), do: root
defp build(queue) do
[{node_a, freq_a} | queue] = queue
[{node_b, freq_b} | queue] = queue
new_node = %Node{
left: node_a,
right: node_b
}
total = freq_a + freq_b
queue = [{new_node, total}] ++ queue
queue
|> Enum.sort_by(fn {_node, frequency} -> frequency end)
|> build()
end
defp find(tree, character, path \\ <<>>)
defp find(%Leaf{value: value}, character, path) do
case value do
^character -> path
_ -> nil
end
end
defp find(%Node{left: left, right: right}, character, path) do
find(left, character, <<path::bitstring, 0::size(1)>>) ||
find(right, character, <<path::bitstring, 1::size(1)>>)
end
defp convert(text, tree) do
text
|> String.graphemes()
|> Enum.reduce(<<>>, fn character, binary ->
code = find(tree, character)
<<binary::bitstring, code::bitstring>>
end)
end
end
```

## Next steps

There's several improvements we can make to this implementation. For example, when encoding the text and looking up the Huffman code in the tree, we don't have to walk the tree for every character. We can do a full traversal once and keep track of each character and its Huffman code in a map.

Also building up the queue can be improved. We don't have to keep sorting the queue after every insert, rather have a look at a proper priority queue.

For an implementation that also serializes the tree to be written to disk or network, head over to the GitHub repository

Thanks for reading! 🤘

_{Originally posted at https://tiemenwaterreus.com/posts/compession-and-huffman-coding-with-elixir/}

## Discussion