As good as they are, tries are not perfect, and allow for further improvement. In this post we will focus on one of their variants, ternary search trees, that trades search performance for greater memoryefficiency in storing nodes' children.
mlarocca / AlgorithmsAndDataStructuresInAction
Advanced Data Structures Implementation
What's Wrong with Tries?
Tries offer extremely good performance for many stringbased operations. Due to their structure, though, they are meant to store an array of children (or a dictionary) for each node. This can quickly become expensive: the total number of edges for a trie with n
elements can swing anywhere between β*n
and β*n*m
, where m
is the average word length, depending on the degree of overlap of common prefixes.
We can use associative arrays, dictionaries in particular, in the implementation of nodes, thus only storing edges that are notnull. For instance, we can start with a small hash table, and grow it as we add more keys. But, of course, this solution comes at a cost: not only the cost to access each edge (that can be the cost of hashing the character, plus the cost of resolving key conflicts), but also the cost for resizing the dictionary when new edges are added.
Moreover, any data structure has an overhead in terms of memory it needs: in Java, each empty HashMap
needs around 36 bytes plus some 32 byte for each entry stored (without considering the actual storage for each key/value), plus 4 bytes times the set's capacity.
An alternative solution that can help us reduce the space overhead associated with tries nodes is the ternary search trie (TST).
Take a look at an example of a TST, storing the following words: [βanβ, βandβ, βantiβ, βendβ, βsoβ, βtopβ, βtorβ].
Only filled (red) nodes are "key nodes", those correspond to words stored in the tree, while empty (white) vertices are just internal nodes.
What's Good in Binary Search Trees?
Similarly to tries, nodes in a TST also need to store a Boolean value, to mark key nodes.
The first difference that you can spot, with respect to a trie, is that a TST stores characters in nodes, not in edges.
As a matter of fact, each TST node stores exactly three edges: to left, right, and middle children.
The "ternary searchβ part of the name should ring a bell, right? Indeed, TSTs work somehow similarly to BSTs, only with three links instead of two. This is because they associate a character to each node, and while traversing the tree, we will choose which branch to explore based on how the next character in the input string compares to the current nodeβs char.
Similarly to BSTs, in TSTs the three outgoing edges of a node N
partition the keys in its subtree; if N
, holds character c
, and its prefix in the tree (the middlenodepath from the TSTβs root to N
, as weβll see) is the string s
, then the following invariants hold:
 All keys
sL
stored in the left subtree ofN
starts withs
, are longer (in terms of number of characters) thans
, and lexicographically less thans+c
:sL < s+c
(or, to put it in another way, the next character insL
is lexicographically less thanc
: ifs=m
,sL[m] < c
).  All keys
sR
stored in the right subtree ofN
starts withs
, are longer thans
, and lexicographically greater thans+c
:sR > s+c
.  All keys in the middle subtree of
N
start withs+c
.
This is best illustrated with an example: check out the graphic above and try to work out, for each node, the sets of substrings that can be stored in its 3 branches.
Partitioning Keys
For instance, let's take the root of the tree:
 Root's middle branch contains all keys starting with
'e'
;  The left branch contains all keys whose first character precedes
'e'
in lexicographic ordering: so, considering only lowercase letters in he English alphabet, one of'a', 'b', 'c', 'd'
;  Finally the right branch, which contains all keys that starts with letters from
'f'
to'z'
.
When we traverse a TST, we keep track of a "search string", as we do with tries: for each node N
, it's the string that could be stored in N
, and it's determined by the path from the root to N
. The way we build this search string is, however, very different with respect to tries.
As you can see from the example above, a peculiarity of TSTs is that a node's children have different meanings/functions.
The middle child is the one followed on characters match. It links a node N
, whose path from root forms the string s
, to a subtree containing all the stored keys that starts with s
. When following the middle node we move one character forward in the search string.
The left and right children of a node, instead, doesn't let us advance in our path. If we had found i
characters in a path from the root to N
(i.e. we followed i
middle links during traversal from root to N
), and we traverse a left or right link, the current search string remains of length i
.
Above, you can see an example of what happens when we follow a rightlink. Differently from middlelinks, we can't update the search string, so if on the left half current node corresponded to the word "and"
, on the right half the highlighted node, whose character is 't'
, corresponds to "ant"
: notice that there is no trace of traversing the previous node, holding 'd'
(as there is also no trace of the root node, and it's like we didn't go through it, because our path had traversed a leftlink from root to get to current node).
Left and right links, in other words, correspond to branching points after a common prefix: for "ant"
and "and"
, for instance, after the first two characters (that can be stored only once, in the same path) we will need to branch out, to store both alternatives.
Which one gets the middlelink, and which one the left or right link? This is not determined beforehand, it only depends on the order they are inserted: first come, first serve! In the figure above, "and"
was apparently stored before "anti"
.
Analysis
Well, TSTs look like a cool alternative to tries, but being cool is not enough for learning and implementing a new data structure with the same use case as another one in our toolbelt that's already welltested and working fine, right? So we should talk a bit more about why we might want to prefer s TST.
Space
So, the question now arises: how many links (and nodes) are created for such TST?
To answer that, suppose we want to store n
keys whose average length is w
.
Then we can say that:

The minimum number of links needed we'll be
w + n  2
: this is when all words share a prefix of lengthw1
, we have a middlenode path of w2 characters (and w1 nodes) from the root, and then we'll branch outn
times at the very last character (with exactly 1 middle link, plusn1
left/right links). An example of this edge case is shown in the figure below, withn=5
andw=4
. 
The worst case scenario happens when no two words share a common prefix, we need to branch out at the root, and then for each word we'll have
w2
middlelinks, for a total ofn*(w1)
links. This is shown in the figure below, withn=5
andw~=4
.
Now that we have an idea of how to build these trees, in the next section, we will take a close look at search
.
All other operations can be derived from tries in the same way, and can be implemented starting with a successful/unsuccessful search, or slightly modifying search.
Running Time
Performancewise, a search hit or a search miss need, in the worst case, to traverse the longest path from the root to a leaf (there is no backtracking, so the longest path is a reliable measure of the cost of the worst case). That means search
can perform at worst A * m
characters comparisons (for completely skewed trees), where A
is the size of the alphabet and m
is the length of the searched string. Being the alphabet's size a constant, we can consider the time required for a successful search to be O(m)
for a string of length m
, and only differs for a constant factor from the trie's homologous.
It is also provable that, for a balanced TST storing n
keys, a search miss requires O(log n)
character comparisons at most (which is relevant for large alphabets, if A * m > n
).
For remove
: it can be performed as a successful search followed by some maintenance (performed during backtracking, it doesn't affect asymptotic analysis), and so its running time is also O(m)
in the best case scenario, and an amortized O(log(n))
for unsuccessful removal.
Finally add
: it's also a search (either successful or unsuccessful) followed by the creation of a node chain with at most m
nodes. Its running time is, then, also O(A*m)
.
Conclusions
TSTs are a valid alternative to tries, trading a slightly worse constant in their running time with an effective saving in the memory used.
Both adhere to the same interface, and allow to implement efficiently some interesting searches on sets of strings.
The way TSTs are implemented, however, allow for a tradeoff between memory (which can be considerably less that the one needed for a trie storing the same set of strings) and speed, where both data structures have the asymptotic behavior, but TSTs are a constant factor slower than tries.
In the next post in the series, we'll talk about a Java implementation for TSTs, so stay tuned because the most interesting material is still to come.
Top comments (0)