This post was originally published at https://www.yourdevopsguy.com/merkle-trees-and-bitcoin.
Merkle trees, also known as binary hash trees, are a type of binary tree. They can be used to efficiently verify the integrity of large sets of data. They are used in both blockchain and non-blockchain-related projects.
Merkle trees are build from the bottom up:
- The leaves of the tree contain the hash of each element in our dataset. This dataset could represent a list of transactions, parts of a file, etc.
- Each internal node is the result of hashing the concatenation of the hashes of its two children.
- Hash functions
- Properties of a Merkle tree
- What problems can a Merkle tree solve?
- Known vulnerability
- Sparse Merkle Trees
Hash functions take as input any piece of data (a string, an image, etc.) and produce a fixed-length output. An example of a hash function is SHA-256, which takes data of arbitrary size and produces an output that is 256 bits long.
Good hash functions have the following properties:
- Uniformity. They produce evenly distributed outputs in the possible output range. In the SHA-256 example, this range would go from 0000...0000 to 1111...1111.
- Deterministic. The same input will always generate the same output.
- Fast to compute.
Cryptographic hash functions need additional properties to make them more secure:
- One-way functions. Given an input x, they generate an output h(x). From an output h(x), it is computationally infeasible to find the input it came from, x.
- Collision resistant. The likelihood of 2 different inputs x, y producing the same output h(x)=h(y) should be negligible.
- Resistant to preimage attacks. Given an output h(x), it is infeasible to find an input y such as h(x) = h(y).
- A minor modification in the input will produce a completely different output. This is known as the avalanche effect.
The properties of a Merkle tree derive from its structure and its use of hash functions:
- Data integrity. The root of the tree will change if any node in the tree is modified. Even the same dataset, with its elements presented in a different order (ex: [T2, T1, T3, T4] instead of [T1, T2, T3, T4]) will produce a different tree.
- Partial verification. We do not need the full tree to verify whether a transaction belongs to it. We only need a path that starts in one of the leaves and goes all the way up to the root (proof of inclusion).
- The height of a tree of N leaves is ceil(log2(N)). The function ceil(x), known as ceil, maps x to the smallest integer greater or equal than x. Ex: ceil(3.14) = 4 and ceil(3) = 3.
I will use Bitcoin to illustrate how Merkle trees can verify if a transaction belongs to a block. In case you aren’t familiar with blockchains, we can oversimplify things imagining that:
A blockchain is a distributed linked list of elements called blocks.
- A list of transactions.
- A header that stores metadata.
- Part of this metadata is the root of the block’s Merkle tree, called Merkle root.
Each block points to the hash of its predecessor, to ensure immutability. If the previous block in the chain changes, even by one bit, its hash would change and it would break the chain.
Imagine you are running an app that needs to verify if a transaction T belongs to a block B. In Bitcoin, full nodes store a block’s Merkle root along with all its transactions. For them, it is easy to verify if T belongs to B. Also, they can generate proofs for any transaction in a block on demand.
What if the app runs on a device with limited disk, computing power, and bandwidth? To determine whether T belongs to B, the device needs information. It can retrieve this information from other participants in the network.
The following problems arise:
- How can we trust that the data we receive from other nodes is correct? Data might be corrupted and other nodes in the network can send fake data.
- Given the limited resources, how can we minimize the amount of information we need to receive from the network?
Merkle trees can help to solve these problems. From the properties of Merkle trees we covered above:
- Data integrity solves problem 1. Let’s assume we receive different pieces of information from a series of participants in the network. We have B, which contains its Merkle root. The data we receive from the network is used to build a Merkle tree whose root must be the same as B‘s Merkle root. Otherwise, we cannot trust the data that others have shared with us.
- Partial verification and having a height of ceil(log2(N)) solve problem 2. Hashing all the transactions would also work to verify data integrity. However, that does not scale well, because proof generation and verification would be O(N) operations.
Simplified Payment Verification(SPV) nodes, commonly known as lightweight nodes or light clients, do not store full blocks. They interact with the network to request the information they need to verify transactions. They can assess whether a transaction belongs to a block requesting an inclusion proof and the block header, where the Merkle root is stored.
The algorithm to generate a Merkle root from a list of transactions is as follows:
- The leaves of a Merkle tree contain the hashes of the transactions that belong to a block. In the image above, [T1, T2, T3, T4, T5, T6] represent transactions.
- From them, we generate the leaves of the Merkle tree using a hash function. If the number of transactions is odd, we duplicate the last node.
- We take 2 consecutive leaves and hash the concatenation of their hashes. This generates their parent. In the image, T1’s and T2’s parent is equal to hash(hash(T1) + hash(T2)).
- We keep repeating this process with all the transactions at the same “level”. These parents will be siblings that will form a new level in the tree hierarchy.
- We keep repeating this process until we only have one element left, the root of the tree.
You can find here my implementation of a Merkle tree. It can verify a Bitcoin block’s Merkle root and generate and verify proofs of inclusion.
Bitcoin’s Merkle tree duplicates the last node in levels with an odd number of nodes. Also, if Bitcoin finds a block that is not valid, it caches its root to avoid trying to mine it again. This combination makes the tree susceptible to second preimage attacks: for an input x, we can find a second input y, different from x, such that h(x) = h(y).
How does this design suffer from this? Let’s go back to our first example, based on the tree represented on the image below:
- The leaves of the tree come from a block B1 containing these transactions [T1, T2, T3, T4, T5, T6]
- The level right above the leaves is [hash(hash(T1) + hash(T2)), hash(hash(T3) + hash(T4)), hash(hash(T5) + hash(T6))]
- Since we have an odd number of nodes, we duplicate the last one. Hence, the next level changes to [..., ..., hash(hash(T5) + hash(T6)), hash(hash(T5) + hash(T6))]
- This is equal to the level produced by a block B2 with the following transactions [T1, T2, T3, T4, T5, T6, T5, T6].
B2 contains duplicate transactions, making the block invalid. Nodes would cache this root so that next time they find a block with this root they discard it. Both B1 and B2 produce the same root. Thus, nodes will reject B1, even though B1 might be a valid block.
There is another known vulnerability in Bitcoin’s original Merkle trees (already fixed). Understanding it requires more knowledge of how Bitcoin builds transactions. Since this is an introductory article, I have decided to leave it out. However, I might cover transactions in more depth in future articles and revisit this vulnerability.
A Sparse Merkle Tree (SMT) is a variation of a Merkle Tree, where every leaf node has an index along with its value. All leaves are initialized to an empty (null) value. By design, SMTs have 2^256 leaves.
The Merkle tree I described above can verify that a transaction is part of the block by exploring the path from its corresponding leaf to the root of the tree.
However, they cannot be used to prove that a transaction does not belong to a block. And in general, to prove that elements do not belong to a dataset. SMTs can do this.
SMTs can produce proofs of non-inclusions in an efficient manner. The idea is that we know the index of every element. Thus, you know which leaf contains an element. If an element is not part of the tree, the leaf where the element would be stored will be null. Therefore, to produce a proof of non-inclusion, we produce a proof of inclusion generated from that node initialized to null.
Proofs of inclusion work in the same way as Merkle trees’.
Updating a leaf an SMT is also efficient because we only need to modify the path that goes from that leaf to the root. Since SMTs have 2^256 leaves, only 256 nodes need modification.
SMTs deserve their own article, but I wanted to at least introduce the basic idea behind them.
Now you should have a clearer idea of what Merkle trees are and what they are used for. If you want to learn more, I suggest you investigate how:
- Git uses them to keep track of changes in files.
- DynamoDB uses them to identify any inconsistencies between nodes.
I hope you found this article helpful. Share it, because you could help someone pass their exams or get a job.