DEV Community

Rex
Rex

Posted on

A Naïve Construction of a Suffix Array for an auto-complete function with considerable input size via TDD workflow

What is a suffix array

Wikipedia definition:

In computer science, a suffix array is a sorted array of all suffixes of a string. It is a data structure used in, among others, full text indices, data compression algorithms, and the field of bibliometrics.

Suffix Array is a space efficient alternative to Suffix Tree. The construction of a suffix array can be O(n) with an advanced algorithm. It can handle extremely large input, and it makes searching substrings very efficient.

Example

The steps to build a suffix array for the word Banana are:

  1. find a list of all the suffixes. They are:
Suffix index
banana$ 1
anana$ 2
nana$ 3
ana$ 4
na$ 5
a$ 6
$ 7
  1. sort the suffixes. The array of the new sorted indexes is the suffix array for Banana: [7, 6, 4, 2, 1, 5, 3].
Suffix index
$ 7
a$ 6
ana$ 4
anana$ 2
banana$ 1
na$ 5
nana$ 3

Because we can reconstruct all suffixes with the original string and its suffix array, there is no need to store any suffixes; therefore, it is very space-efficient compared to the suffix tree.

Typically, we add a $ sign to the end of the word to separate words i.e. to prevent a suffix of the previous word from being treated as prefixes to the following word. For example, if we are tasked to build a suffix array for Love Banana, we would concatenate the two words into a single string Love$Banana$ and then build the suffix array. The $ sign prevents the program to treat ove as a prefix of banana. Note that we can choose any non-alphabet characters as long as it is smaller than any alphabet.

Auto-complete

  1. Input: An array of terms and a search text
  2. Output: a list of terms containing the given search text
Approach

build a suffix array(naive approach: O(n^2logn), there are ways to build a suffix array at O(n))

  1. concatenate all terms, with $ at the end of each term, to prevent suffix being treated as a prefix to other terms
  2. get all suffixes from the concatenated string
  3. sort the suffixes, remove suffixes starting with $
  4. build a suffix array from the sorted suffixes

binary search the suffix array to find suffixes matching the search text

  1. start in the middle, compare search text with the middle suffix
  2. if smaller, search in the left half
  3. if larger, search in the right half
  4. if found a match to the search text
    • expand and get all adjacent matches
  5. return all matches

Code

Tests

Tests result from the above planning in the approach section and TDD workflow. The tests can also serve as documentation for the implementations.

Implementations

My ultimate goal is to write simple and easy to understand code i.e. require very low cognitive load to read. If you are interested, you can read this awesome blog post by David Whitney Writing good code by understanding cognitive load.

If you feel you need to work out mentally on how something works in the code, I have failed.

Discussion (0)