In many domains and applications, we are sometimes led to having to compare several words or phrases and quantify their similarity.
A fairly popular notion for this is the edit distance.
Edit distance? What is it?
Edit distance is a metric used to quantify the dissimilarity between two strings by counting the amount of operation needed to transform one to another. Depending on the algorithm chosen, the operations allowed may include: addition, deletion, substitution and finally translation.
This measurement can then be used to calculate a percentage of similarity between two strings with this formula:
(maxLength - distance) / maxLength
Differents types of edit distance
There are many algorithms for calculating edit distance of two distinct strings.
Among them, the most famous are:
Levenstein distance allowing insertion, deletion and substitution. This is the most famous edit distance algorithm but despite what you may think, it is not always the most appropriate.
Damerau-Levenstein distance allowing addition, deletion, substitution and transposition.
There are two different methods of this algorithm, OSA (Optimal string alignment distance) which allows only a transposition by substring and Adjacent Transpositions which allows it as much transposition as necessary but which requires a fixed alphabet.LCS distance (Longest Common Subsequence) allowing addition and deletion. It's based on Longest Common Subsequence algorithm.
Hamming distance only applicable on strings of the same length, it allows only character substitution.
Jaro distance allowing only transposition and returning a normalized similarity index between 0 and 1.
There is a variant called Jaro-Winkler which takes into account the common prefix of the two strings and which gives it greater weight in the calculation.
The Golang implementation of all of these algorithms with Unicode compatibility are available here :
https://github.com/hbollon/go-edlib
For the rest of this course, we will focus on Levenstein algorithm.
What is it used for?
Edit distance can be used in several application areas such as:
- Fuzzy search algorithms
- Correction of spelling mistakes
- Word completion
- DNA sequence alignment algorithms
- Pattern matching
- And many more!
Levenstein distance
Levensthein is one of the most known edit distance algorithm. It was created by the Soviet mathematician Vladimir Levenshtein, who considered this distance in 1965. It allows insertions, deletions or substitutions.
For example, the Levenshtein distance between “kitten” and “sitting” is 3 since, at a minimum, 3 edits are required to change one into the other.
- kitten → sitten (substitution of “s” for “k”)
- sitten → sittin (substitution of “i” for “e”)
- sittin → sitting (insertion of “g” at the end).
Definition
The Levenshtein distance between two strings is defined by lev a,b(|a|,|b|) with:
- a,b : two input strings
- |a| : length of a
- |b| : length of b
We can find the Levenstein function on Wikipedia
However, it's not easy to understand at first.
Firstly, if one of the input strings is empty then the edit distance is the length of the other.
Else, we will solve the Levenstein algorithm.
In this part, 1(aᵢ≠bⱼ) is equal to 0 when aᵢ≠bⱼ and equal to 1 otherwise. It
- aᵢ refers to the character of string a at position i
- bⱼ refers to the character of string b at position j
We want to check that these are not equal, because if they are equal, no edit is needed, so we shouldn't add 1 to the edit distance. However, if they aren't equal, we want to add 1 to it.
Algorithm and complexity
There are several implementations of Levenshtein's algorithm, however, not all of them are efficient.
We are going to look after the iterative method with full matrix.
The matrix will contain distance between all prefixes of your input strings, then we can compute the matrix values in a dynamic programming style and finally find the distance between the two full strings as the last value computed.
For exemple, we will have this matrix for kitten and sitting:
As we can see, the edit distance between kitten and sitting is 3.
Pascal algorithm
On Wikipedia, you can find a pseudo-code algorithm of this implementation:
function LevenshteinDistance(char s[1..m], char t[1..n]):
// for all i and j, d[i,j] will hold the Levenshtein distance between
// the first i characters of s and the first j characters of t
declare int d[0..m, 0..n]
set each element in d to zero
// source prefixes can be transformed into empty string by
// dropping all characters
for i from 1 to m:
d[i, 0] := i
// target prefixes can be reached from empty source prefix
// by inserting every character
for j from 1 to n:
d[0, j] := j
for j from 1 to n:
for i from 1 to m:
if s[i] = t[j]:
substitutionCost := 0
else:
substitutionCost := 1
d[i, j] := minimum(d[i-1, j] + 1, // deletion
d[i, j-1] + 1, // insertion
d[i-1, j-1] + substitutionCost) // substitution
return d[m, n]
Normally, if you understand what has been said so far, you should roughly understand it. :)
This algorithm has a worst-case complexity of O(n*m), here a quick illustration of that:
Golang implementation
It's all well and good to know all this, but how about implementing it in a real programming language?
For this, we will use Go. It's a modern and compiled language design by Google.
Our implementation will be compatible with Unicode, so you will be able to put emoji or mathematical symbols for example in input strings!
To begin, let's create the function!
func LevenshteinDistance(str1, str2 string) int {
}
It will take two input strings and return an edit distance.
Now, we must convert our strings into rune arrays to be Unicode compatible:
// Convert string parameters to rune arrays to be compatible with non-ASCII
runeStr1 := []rune(str1)
runeStr2 := []rune(str2)
In Go, a rune is an alias for int32 (32-bit integer).
Once that is done, we need to retrieve the length of these strings and start looking if these strings are equal or if any of them is empty.
// Get and store length of these strings
runeStr1len := len(runeStr1)
runeStr2len := len(runeStr2)
if runeStr1len == 0 {
return runeStr2len
} else if runeStr2len == 0 {
return runeStr1len
} else if utils.Equal(runeStr1, runeStr2) {
return 0
}
utils.Equal is a utility function that I have defined to check equality of two rune arrays, you can retrieve it here. Alternatively, you can compare directly strings input.
To be more simple, our implementation will use a simple slice instead of a matrix, but the principle is the same.
Let's create our distance array and initialize it:
// The slice size must be of the length of "runeStr1len+1"
column := make([]int, runeStr1len+1)
for y := 1; y <= runeStr1len; y++ {
column[y] = y
}
Finally, let's compute the column array with the Levenstein algorithm and return the distance between our input strings:
for x := 1; x <= runeStr2len; x++ {
column[0] = x
lastkey := x - 1
for y := 1; y <= runeStr1len; y++ {
oldkey := column[y]
var i int
if runeStr1[y-1] != runeStr2[x-1] {
i = 1
}
column[y] = utils.Min(
utils.Min(column[y]+1, // insert
column[y-1]+1), // delete
lastkey+i) // substitution
lastkey = oldkey
}
}
return column[runeStr1len]
utils.Min, like previously, is a custom function, you can retrieve it here.
That's it! Now you can compute the Levenstein distance between any words or sentences that you want!
If you haven't made any mistakes, you should have this:
// LevenshteinDistance calculate the distance between two string
// This algorithm allow insertions, deletions and substitutions to change one string to the second
// Compatible with non-ASCII characters
func LevenshteinDistance(str1, str2 string) int {
// Convert string parameters to rune arrays to be compatible with non-ASCII
runeStr1 := []rune(str1)
runeStr2 := []rune(str2)
// Get and store length of these strings
runeStr1len := len(runeStr1)
runeStr2len := len(runeStr2)
if runeStr1len == 0 {
return runeStr2len
} else if runeStr2len == 0 {
return runeStr1len
} else if equal(runeStr1, runeStr2) {
return 0
}
column := make([]int, runeStr1len+1)
for y := 1; y <= runeStr1len; y++ {
column[y] = y
}
for x := 1; x <= runeStr2len; x++ {
column[0] = x
lastkey := x - 1
for y := 1; y <= runeStr1len; y++ {
oldkey := column[y]
var i int
if runeStr1[y-1] != runeStr2[x-1] {
i = 1
}
column[y] = min(
min(column[y]+1, // insert
column[y-1]+1), // delete
lastkey+i) // substitution
lastkey = oldkey
}
}
return column[runeStr1len]
}
// Return the smallest integer among the two in parameters
func min(a int, b int) int {
if b < a {
return b
}
return a
}
// Compare two rune arrays and return if they are equals or not
func equal(a, b []rune) bool {
if len(a) != len(b) {
return false
}
for i, v := range a {
if v != b[i] {
return false
}
}
return true
}
I sincerely hope you enjoyed this course and that it is not too messy! As a french student, this my first stories attempt so any feedback are welcome! :)
I've done an edit distance and string comparison open-source library: Go-Edlib.
It implements most popular edit distance algorithms and soon all of them! Currently, it includes: Levenshtein, LCS, Hamming, Damerau-Levenshtein (OSA and Adjacent transpositions algorithms), Jaro/Jaro-Winkler.
All these algorithms have been implemented in such a way as to be fully compatible with Unicode
It also includes fuzzy search algorithms based on edit distance and few others string comparisons functions.
I'm actively looking for feedback and/or contributions to improve this library or have new functionality ideas to add!
Discussion (1)
Interesting read!