Spellcheckers and autocorrect can feel like magic. They’re at the core of everyday applications — our phones, office software, google search (you type in ‘incorect’ and google returns ‘Did you mean incorrect).
So how does this magic happen? And how can we build our own?
First let’s consider what a spellchecker does.
Implementation of the algorithm in my github repo.
In a nutshell, a basic text correction has to have at least 3 components:
The candidate model produces a list of potential correct terms. Potential candidates are created by doing all possible permutations (add, substitute, transpose or remove a character) within a given edit distance from the original word.
This edit distance is basically the measure of how many operations (adding, substituting, transposing or deleting a character) is needed to convert one word into another.
Though a lot of candidates will be produced at this point, not all of them are significant. The candidates must be tested against a corpus of current known terms.
The language model is essentially a corpus of known valid language words, and the frequency / probability of appearing these words. We can usually get a rough sample model from a vast amount of literature from text mining, and then come up with this model.
The selection criteria is what we use to determine which is the “right” word to use as the revised version of the possible candidates we have found. Calculating a score based on the edit distance (the lower the better) and how much the word appears in our language model (the higher the better) will be a potential selection criterion and selecting the candidate word with the highest score.
So now that we know the basics about text correction lets focus in the most important part of these 3 to implement the spellcheker. The selecition criteria.
There are a few different ways to pick candidates:
- The most simplistic approach is to measure the distance you have selected to edit between your term and the entire dictionary. The method, while effective, is extremely costly.
- Phonetic algorithms such as Soundex, Phonex or Metaphone. Such algorithms can transform any string into a short sequence allowing for the indexing of string by pronunciation. Both ‘H600’ will return ‘Hurry’ and ‘Hirry.’ You can easily identify phonetically related candidates by pre-processing the entire vocabulary and indexing it using phonetic code. Rapid on runtime but it just corrects phonetic errors.
- Computing a list of potential misspelling for your word (insertion, omission, transposition or replacement) and matching it to your dictionary. Although this is better than the naive method, due to collection of misspellings increasing in size at a rate of 54 * length+25 it is very sluggish. For further detail see Peter Norvig’s excellent article about spell correction.
- Symmetric spelling correction that takes up the previous idea and expands it by computing mispellings for both the dictionary and the incorrect terminology. This technique is both reliable and quick blazing but at the cost of considerable precomputing and disk space, and includes your dictionary’s frequency list.
Given all these options the one we have chosen to make the correction in our case is the last option. Let’s see what it’s about.
SymsSpell is an algorithm for finding all strings in very short time within a fixed edit distance from a large list of strings. SymSpell derives its speed from the Symmetric Delete spelling correction algorithm and keeps its memory requirement in check by prefix indexing.
The Symmetric Delete spelling correction algorithm reduces the difficulty of the generation of edit candidates and the dictionary quest for the difference in distance. It is six orders of magnitude faster(than the traditional method with deletes + transposes + substitutes + inserts) and language independent.
Opposite to other algorithms only deletes are required, no transposes + replaces + inserts. Transposes + replaces + inserts of the input term are transformed into deletes of the dictionary term. Replaces and inserts are expensive and language dependent: e.g. Chinese has 70,000 Unicode Han characters!
The speed comes from inexpensive delete-only edit candidate generation and pre-calculation. An average 5 letter word has about 3 million possible spelling errors within a maximum edit distance of 3, but SymSpell needs to generate only 25 deletes to cover them all, both at pre-calculation and at lookup time. Magic!
The idea behind prefix indexing is that the discriminatory power of additional chars is decreasing with word length. Thus by restricting the delete candidate generation to the prefix, we can save space, without sacrificing filter efficiency too much.
Also this algorithm is fast, because a fast index access at search can be obtained by using a hash table with an average search time complexity of O(1).
The SymSpell algorithm exploits the fact that the edit distance between two terms is symmetrical so we can combine both and meet in the middle, by transforming the correct dictionary terms to erroneous strings, and transforming the erroneous input term to the correct strings.
Because adding a char on the dictionary is equivalent to removing a char from the input string and vice versa, we can on both sides restrict the transformation to deletes only.
Another advantage of this algorithm is that has constant time O(1), i.e. independent of the dictionary size (but depending on the average term length and maximum edit distance), because the index is based on a Hash Table which has an average search time complexity of O(1).
Finally, to see the potential of this algorithm, let’s compare it against other approaches. For example, BK-Trees have a search time of O(log dict_size), whereas the SymSpell algorithm is constant time O(1). Another algorithm that is also widely used in spell-checking are Tries. They have a comparable search performance to symspell approach. But a Trie is a prefix tree, which requires a common prefix. This makes it suitable for autocomplete or search suggestions, but not applicable for spell checking. If your typing error is e.g. in the first letter, than you have no common prefix, hence the Trie will not work for spelling correction.
Well if you got here and want to know more about this algorithm and want to see it work you can visit my github repo.
Please share any thoughts or comments you have. Feel free to ask and correct me if I’ve made some mistakes.
Thanks for your time!