In computer science, a suffix array is a sorted array of all suffixes of a string. It is a data structure used in, among others, full text indices, data compression algorithms, and the field of bibliometrics.
Suffix Array is a space efficient alternative to Suffix Tree. The construction of a suffix array can be O(n) with an advanced algorithm. It can handle extremely large input, and it makes searching substrings very efficient.
The steps to build a suffix array for the word Banana are:
- find a list of all the suffixes. They are:
- sort the suffixes. The array of the new sorted indexes is the suffix array for Banana: [7, 6, 4, 2, 1, 5, 3].
Because we can reconstruct all suffixes with the original string and its suffix array, there is no need to store any suffixes; therefore, it is very space-efficient compared to the suffix tree.
Typically, we add a
$ sign to the end of the word to separate words i.e. to prevent a suffix of the previous word from being treated as prefixes to the following word. For example, if we are tasked to build a suffix array for Love Banana, we would concatenate the two words into a single string
Love$Banana$ and then build the suffix array. The
$ sign prevents the program to treat
ove as a prefix of
banana. Note that we can choose any non-alphabet characters as long as it is smaller than any alphabet.
- Input: An array of terms and a search text
- Output: a list of terms containing the given search text
build a suffix array(naive approach: O(n^2logn), there are ways to build a suffix array at O(n))
- concatenate all terms, with $ at the end of each term, to prevent suffix being treated as a prefix to other terms
- get all suffixes from the concatenated string
- sort the suffixes, remove suffixes starting with $
- build a suffix array from the sorted suffixes
binary search the suffix array to find suffixes matching the search text
- start in the middle, compare search text with the middle suffix
- if smaller, search in the left half
- if larger, search in the right half
- if found a match to the search text
- expand and get all adjacent matches
- return all matches
Tests result from the above planning in the approach section and TDD workflow. The tests can also serve as documentation for the implementations.
My ultimate goal is to write simple and easy to understand code i.e. require very low cognitive load to read. If you are interested, you can read this awesome blog post by David Whitney Writing good code by understanding cognitive load.
If you feel you need to work out mentally on how something works in the code, I have failed.