Marcello La Rocca

Posted on Oct 11, 2020

Ternary Search Trees

#algorithms #100daysofcode #beginners #computerscience

As good as they are, tries are not perfect, and allow for further improvement. In this post we will focus on one of their variants, ternary search trees, that trades search performance for greater memory-efficiency in storing nodes' children.

mlarocca / AlgorithmsAndDataStructuresInAction

Advanced Data Structures Implementation

What's Wrong with Tries?

Tries offer extremely good performance for many string-based operations. Due to their structure, though, they are meant to store an array of children (or a dictionary) for each node. This can quickly become expensive: the total number of edges for a trie with n elements can swing anywhere between |∑|*n and |∑|*n*m, where m is the average word length, depending on the degree of overlap of common prefixes.

We can use associative arrays, dictionaries in particular, in the implementation of nodes, thus only storing edges that are not-null. For instance, we can start with a small hash table, and grow it as we add more keys. But, of course, this solution comes at a cost: not only the cost to access each edge (that can be the cost of hashing the character, plus the cost of resolving key conflicts), but also the cost for resizing the dictionary when new edges are added.
Moreover, any data structure has an overhead in terms of memory it needs: in Java, each empty HashMap needs around 36 bytes plus some 32 byte for each entry stored (without considering the actual storage for each key/value), plus 4 bytes times the set's capacity.

An alternative solution that can help us reduce the space overhead associated with tries nodes is the ternary search trie (TST).
Take a look at an example of a TST, storing the following words: [“an”, “and”, “anti”, “end”, “so”, “top”, “tor”].
Only filled (red) nodes are "key nodes", those correspond to words stored in the tree, while empty (white) vertices are just internal nodes.

What's Good in Binary Search Trees?

Similarly to tries, nodes in a TST also need to store a Boolean value, to mark key nodes.
The first difference that you can spot, with respect to a trie, is that a TST stores characters in nodes, not in edges.
As a matter of fact, each TST node stores exactly three edges: to left, right, and middle children.

The "ternary search” part of the name should ring a bell, right? Indeed, TSTs work somehow similarly to BSTs, only with three links instead of two. This is because they associate a character to each node, and while traversing the tree, we will choose which branch to explore based on how the next character in the input string compares to the current node’s char.

Similarly to BSTs, in TSTs the three outgoing edges of a node N partition the keys in its subtree; if N, holds character c, and its prefix in the tree (the middle-node-path from the TST’s root to N, as we’ll see) is the string s, then the following invariants hold:

All keys sL stored in the left subtree of N starts with s, are longer (in terms of number of characters) than s, and lexicographically less than s+c: sL < s+c (or, to put it in another way, the next character in sL is lexicographically less than c: if |s|=m, sL[m] < c).
All keys sR stored in the right subtree of N starts with s, are longer than s, and lexicographically greater than s+c: sR > s+c.
All keys in the middle subtree of N start with s+c.

This is best illustrated with an example: check out the graphic above and try to work out, for each node, the sets of sub-strings that can be stored in its 3 branches.

Partitioning Keys

For instance, let's take the root of the tree:

Root's middle branch contains all keys starting with 'e';
The left branch contains all keys whose first character precedes 'e' in lexicographic ordering: so, considering only lower-case letters in he English alphabet, one of 'a', 'b', 'c', 'd';
Finally the right branch, which contains all keys that starts with letters from 'f' to 'z'.

When we traverse a TST, we keep track of a "search string", as we do with tries: for each node N, it's the string that could be stored in N, and it's determined by the path from the root to N. The way we build this search string is, however, very different with respect to tries.

As you can see from the example above, a peculiarity of TSTs is that a node's children have different meanings/functions.

The middle child is the one followed on characters match. It links a node N, whose path from root forms the string s, to a sub-tree containing all the stored keys that starts with s. When following the middle node we move one character forward in the search string.

The left and right children of a node, instead, doesn't let us advance in our path. If we had found i characters in a path from the root to N (i.e. we followed i middle links during traversal from root to N), and we traverse a left or right link, the current search string remains of length i.

Above, you can see an example of what happens when we follow a right-link. Differently from middle-links, we can't update the search string, so if on the left half current node corresponded to the word "and", on the right half the highlighted node, whose character is 't', corresponds to "ant": notice that there is no trace of traversing the previous node, holding 'd' (as there is also no trace of the root node, and it's like we didn't go through it, because our path had traversed a left-link from root to get to current node).

Left and right links, in other words, correspond to branching points after a common prefix: for "ant" and "and", for instance, after the first two characters (that can be stored only once, in the same path) we will need to branch out, to store both alternatives.

Which one gets the middle-link, and which one the left or right link? This is not determined beforehand, it only depends on the order they are inserted: first come, first serve! In the figure above, "and" was apparently stored before "anti".

Analysis

Well, TSTs look like a cool alternative to tries, but being cool is not enough for learning and implementing a new data structure with the same use case as another one in our toolbelt that's already well-tested and working fine, right? So we should talk a bit more about why we might want to prefer s TST.

Space

So, the question now arises: how many links (and nodes) are created for such TST?
To answer that, suppose we want to store n keys whose average length is w.
Then we can say that:

The minimum number of links needed we'll be w + n - 2: this is when all words share a prefix of length w-1, we have a middle-node path of |w-2| characters (and |w-1| nodes) from the root, and then we'll branch out n times at the very last character (with exactly 1 middle link, plus n-1 left/right links). An example of this edge case is shown in the figure below, with n=5 and w=4.
The worst case scenario happens when no two words share a common prefix, we need to branch out at the root, and then for each word we'll have w-2 middle-links, for a total of n*(w-1) links. This is shown in the figure below, with n=5 and w~=4.

Now that we have an idea of how to build these trees, in the next section, we will take a close look at search.

All other operations can be derived from tries in the same way, and can be implemented starting with a successful/unsuccessful search, or slightly modifying search.

Running Time

Performance-wise, a search hit or a search miss need, in the worst case, to traverse the longest path from the root to a leaf (there is no backtracking, so the longest path is a reliable measure of the cost of the worst case). That means search can perform at worst |A| * m characters comparisons (for completely skewed trees), where |A| is the size of the alphabet and m is the length of the searched string. Being the alphabet's size a constant, we can consider the time required for a successful search to be O(m) for a string of length m, and only differs for a constant factor from the trie's homologous.
It is also provable that, for a balanced TST storing n keys, a search miss requires O(log n) character comparisons at most (which is relevant for large alphabets, if |A| * m > n).

For remove: it can be performed as a successful search followed by some maintenance (performed during backtracking, it doesn't affect asymptotic analysis), and so its running time is also O(m) in the best case scenario, and an amortized O(log(n)) for unsuccessful removal.

Finally add: it's also a search (either successful or unsuccessful) followed by the creation of a node chain with at most m nodes. Its running time is, then, also O(|A|*m).

Conclusions

TSTs are a valid alternative to tries, trading a slightly worse constant in their running time with an effective saving in the memory used.

Both adhere to the same interface, and allow to implement efficiently some interesting searches on sets of strings.

The way TSTs are implemented, however, allow for a trade-off between memory (which can be considerably less that the one needed for a trie storing the same set of strings) and speed, where both data structures have the asymptotic behavior, but TSTs are a constant factor slower than tries.

In the next post in the series, we'll talk about a Java implementation for TSTs, so stay tuned because the most interesting material is still to come.

DEV Community