Marcello La Rocca

Posted on Oct 10, 2020

Implementing Tries in Java

#algorithms #java #100daysofcode #computerscience

Ever wondered how much memory is wasted when you store plain strings in a container, for instance in a binary search tree (BST)? Well, that heavily depends on how many strings you store, and how much they overlap.
Turns out, when you store strings with shared prefixes, you can save a lot of space by storing any duplicate only once.

That's the idea behind a trie (pronounced as try), a tree-like data structure that allows to more efficiently store and query large sets of strings, assuming many of them share some common prefixes.
Many applications manipulating strings, from spell-checkers to bioinformatics, can and do benefit from using tries.

In this post I am not going to describe tries in detail (you can take a look here for a refresher), but I'll present an overview of the methods they provide, and explain when and why they should be used.

mlarocca / AlgorithmsAndDataStructuresInAction

Advanced Data Structures Implementation

Why Trie? (Pun Intended)

First things first: why do we need tries again? After all, we have plenty of containers for strings, including hash tables and binary search trees (BST).

Balanced trees, in particular, guarantee logarithmic running time in the worst-case for all the main operations and, in the general case, when we don’t know anything about the data we need to store and (later) search, this is really the best we can hope.

But… what happens when we know that we will only store certain types of data in a container? There are several cases where we can leverage the information on the kind of data we need to handle to craft better algorithms than the general-purpose ones.

Take, for instance, sorting: it has been proven that it’s not possible to sort any generic list of n elements with less than O(n*log(n)) comparisons, but if we know that keys are integers in a limited range of k possible values, and k << n, we can use RadixSort, which means achieving an amortized O(n) running time, defying the lower bound for sorting by comparison.

Likewise, when we talk about containers, string containers are a special case that we can optimize both with respect to memory and running time.

How do Tries Work?

If we need to store strings whose characters are from an alphabet ∑ with |∑|=k symbols, we can implement a trie as a k-ary tree where each node has (at most) k children, each child marked with a different character in ∑; links to children can point to another node, or to null.

If we only show non-null links, this is what a trie storing words "a", "an", "and", "anti" etc... looks like:

A notable difference with binary and k-ary search trees is that nodes in a trie don't actually store the keys associated with them. Instead, the characters held in the trie are actually stored in the edges between a node and its children.

Each key is stored as a path in the trie, where each path's links are labeled after the characters in the key; black, fully-filled nodes mark the end of paths corresponding to keys stored in the trie, while empty (lighter) nodes are just intermediate nodes.

Some Java code should help clarify how a trie is implemented.

public class TrieNode {
    Map<Character, TrieNode> children;
    boolean storesKey;

    public TrieNode(String key) {
        children = new HashMap<>();
        if (key.isEmpty()) {
            storesKey = true;
        } else {
            storesKey = false;
            Character character = key.charAt(0);
            children.put(character, new TrieNode(key.substring(1)));
        }
    }

    public TrieNode() {
        children = new HashMap<>();
        storesKey = false;
    }
}

The code in this snippet is simplified by omitting validation and by using the inefficient String::substring method, just for the sake of space and clarity.
You can find a cleaned-up, more efficient implementation on GitHub.

If you'd like to read more about how tries are designed and structured, check out this section on Manning's livebook for "Algorithms and Data Structures in Action".

Memory

The memory saving that a trie can achieve in comparison to a BST depends on many factors, first of all the overhead for objects: since this trie has more nodes, the more this overhead is, the larger the delta will be.

Obviously, however, the shape of the tree and the actual number of nodes also play a big role: turns out, tries are more efficient when holding keys with a large shared prefix.

In the example below we assumed that each link needs a 64-bit pointer/reference, that each object has a 4-byte overhead, and a string with m characters takes m+1 bytes to be stored.

In these particular case, most trie nodes are filled (which means key nodes) and when the ratio of key nodes versus intermediate nodes is higher, intuitively it means that the efficiency of the trie is also higher, because when a path has more than one key node, we are storing at least two words in a single path (one of which is a prefix of the other): in a BST they would require two BST nodes storing both of the strings separately.

Another sign of a more efficient storage is when there are deep nodes branching out: in that case as well, the trie is "compressing" the space needed for two strings with a common prefix by storing the common prefix only once.

Running Time

Performance-wise, how fast is searching a string in a trie?
I'll discuss the search method in the next section, but let's take a look at its performance here - the only information you need at this time is that it's going to be a recursive method.

That's because the number of recursive calls search makes is limited by the smaller of two values: the maximum height of the trie and the length of the search string.
The latter is usually shorter than the former, but either way, for a string of length m we can be sure that no more than O(m) calls are going to be made, regardless of the number of keys stored in the trie.

Since each call can be implemented in such a way to require amortized constant time, the whole search takes O(m) amortized running time, while in a balanced BST storing strings the same search would need O(m*log(n)) amortized running time.

The same analysis can be extended to other operations like insert and remove.

But the advantage in using tries is even more apparent when we consider two particular operations:

Finding the longest prefix of a string s stored in the trie.
Finding all strings in the trie starting with a string s.

Both operations are O(m*n) in a BST, while in a trie the former is O(m), and latter is O(m+n).😱

We'll talk about those operations in detail later, in a different post in this series, but you can check how they are implemented for tries on GitHub.

Search

Assuming we have built a proper trie, how do we check if it contains a certain key?

It’s not that complicated, compared to a BST: the main difference is that we need to walk the tree one character (of the searched key) at the time, following the link that is marked precisely with that character.

Both strings and tries are recursive structures, whose unit of iteration is the single character; a string s, in fact, can be described as either:

The empty string "";
The concatenation of a character c and a string s': s=c+s', where s' is a string one character shorter than s, and can possibly be the empty string.

For instance, the string “word” is made of the character 'w' concatenated to the string “ord”, which in turn is 'o' + “rd” etc…, until we get to “d”, which can be written as the character 'd' concatenated to the empty string “”.

And since both strings and tries are recursive, it’s natural to define the search method recursively; we can restrict to consider just the case where we search the first character of a string s=c+s' starting from the root R of the a trie T.
If c, the first character of s, matches an outgoing of R, then we can search s' in the (sub)trie T'. We assume the root of the sub-trie is not null (it's a reasonable assumption, easy to guarantee in implementations).

If at any point s is the empty string, then we have traversed the whole path in the trie corresponding to s: we then need to check current node to verify if it is stored in the tree or not.

If, instead, at some point current node doesn’t have an outgoing edge matching current character c, then we are sure string s is not stored in the trie.

Code is worth a thousand words, so, let's look at the implementation of search:

public TrieNode search(String key) {
    if (key.isEmpty()) {
        return storesKey ? this : null;
    } else {
        Character character = key.charAt(0);
        if (children.containsKey(character)) {
            return children.get(character).search(key.substring(1));
        } else {
            return null;
        }
    }
}

Again, here I'm trying to provide the simplest, cleanest possible implementation - check out the repo for a more efficient and thorough version.

You can also check out a few examples and a more detailed explanation of this method in this section of the book.

Inserting a New Key

Insertion works similarly to search, it could actually be though of as a partially-successful search followed by a call to the constructor (the variant taking a string and creating a chain of nodes).

private boolean add(String key) {
    if (key.isEmpty()) {
        if (this.storesKey) {
            return false;
        } else {
            this.storesKey = true;
            return true;
        }
    } else {
        Character character = key.charAt(0);
        if (children.containsKey(character)) {
            return children.get(character).add(key.substring(1));
        } else {
            children.put(character, new TrieNode(key.substring(1)));
            return true;
        }
    }
}

Deleting a Key

In its simplest implementation, also remove can be implemented as a successful search followed by a key removal (which, in this implementation, means just setting the storesKey flag to false).

But... we can do better than that: in particular, we can decide to free up any unnecessary space held by nodes in branches of the tree that are not storing keys anymore.

In the example above, you can see that if we remove keys stored in a leaf, we can always free up at least one node, and sometimes many more; when, instead, we delete keys stored in internal nodes, there isn't the possibility to free any space.

I'll talk more diffusely about how to free-up space while deleting keys and how we can decide if we need to implement this clean-up in a later post, when describing TSTs; for the moment, let's take a look at the Java code for remove, with a version including freeing-up unused nodes.

private boolean remove(String key, AtomicBoolean purge) {
    if (key.isEmpty()) {
        if (storesKey) {
            storesKey = false;
            if (children.isEmpty()) {
                // We can always purge leaves
                purge.set(true);
            }
            return true;
        } else {
            return false;
        }
    } else {
        Character character = key.charAt(0);
        if (children.containsKey(character)) {
            boolean deleted = children.get(character).remove(key.substring(1), purge);
            if (deleted && purge.get()) {
                // If the node was deleted and this branch was purged up to this node, we can clean its old link
                children.remove(character);
                if (!children.isEmpty() || this.storesKey) {
                    // If there are other branches in the tree rooted at this node, we can't purge the trie any further
                    purge.set(false);
                }
            }
            return deleted;
        } else {
            return false;
        }
    }
}

Conclusions

This first post is a primer in the world of string containers and prefix trees.

Some takeaways from this post:

Tries take less memory than other general-purpose containers when you need to store many strings with a high degree of overlap: if there are many prefixes shared by two or more strings in the set, then tries are the answer.
Tries allow to implement core operations like search, insert and delete with a number of operations proportional to the length of the input string (regardless of how many keys are stored in the container). For general-purpose containers, the best we can get is O(m*log(n)), when there are n keys stored, and the input string has m characters.
Tries vastly outperform general-purpose containers on two key operations on sets of strings:
- Finding the longest prefix of an input string s;
- Enumerating all the strings starting with a certain prefix.

I feel this puts enough on readers' plate for a single post, but stay tuned, because in the next few posts I'll talk more about the core methods that makes these containers invaluable in many fields, including Bioinformatics: finding the longest prefix of a string and all enumerating all keys that have a string s as their prefix.

DEV Community