Finding similarity using Edit Distance-Information Retrieval

rahul1719 profile image rahul ・2 min read

Edit distance is a way of quantifying how dissimilar two strings are to one another by counting the minimum number of operations required to transform one string into the other.

How to calculate edit distance?

Ex: To calculate the edit distance or cost to convert string “Sunday” to “Saturday“.

——-> Traverse horizontally

Alt Text

Edit distance of converting sunday to saturday is 3.
Edit 'r' in saturday to 'n' in sunday
Delete in saturday 't'
Delete in saturday 'a'

To convert :

Considering the 1st row in the above table

S -> S the cost is 0 (highlighted in green)
S U – > S the cost is 1 (U- deleted)
S U N -> S the cost is 2 ( U N – deleted)
S U N D -> S The cost is 3 (U N D deleted)
S U N D A – >S The cost is 4 (UNDA needs to be deleted)

SUNDAY -> S The cost is 5 ( highlighted in green .UNDAY needs to be deleted)

S –> SA cost is 1
SU –> SA the cost is 1
SUN – SA the cost is 2.

Note : (If the char in the row and column dont match Ex : in the iteration SUN ->SA (N doesn’t match A) then least cost from the upper cell,left cell and the diagonally upper cell +1 to calculate the cost)

So, in the iteration

Scenario 1:

S U N — > S A the cost 2 can be simply calculated by considering the nearby least cell value +1. (In the above table the least cell value highlighted in red and adding 1. least value of the 3 cells is “1”,so adding 1 to it gets us 2 ,which is the cost for replacing SUN – SA ).

Scenario 2 :

SUNDA –>SA Since the last char in the source string (SUNDA) is “A” and is same as the last char in the destination string, the cost is taken from the previous step SUND — >S (highlighted in pink) as there is no cost in replacing A -> A. Hence the cost for SUND –>S and SUNDA –> SA are same as 3.

Pseudo Code:

costMatrix[i][j] = min(costMatrix[i-1][j],costMatrix[i-1][j-1],costMatrix[i][j-1]) +1


markdown guide