Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English. It is commonly used with databases to help with searching and is built-in to many database engines such as PostgreSQL and MySQL. SoundEx is not included with SQLite by default and there may be situations when you want to use it when searching.
Fortunately the algorithm is not all that difficult. You can read more about SoundEx on Wikipedia, but here are the general steps:
- Retain the first letter of the name and drop all other occurrences of a, e, i, o, u, y, h, w.
- Replace consonants with digits as follows (after the first letter):
- b, f, p, v → 1
- c, g, j, k, q, s, x, z → 2
- d, t → 3
- l → 4
- m, n → 5
- r → 6
- If two or more letters with the same number are adjacent in the original name (before step 1), only retain the first letter; also two letters with the same number separated by ‘h’ or ‘w’ are coded as a single number, whereas such letters separated by a vowel are coded twice. This rule also applies to the first letter.
- If you have too few letters in your word that you can’t assign three numbers, append with zeros until there are three numbers. If you have more than 3 letters, just retain the first 3 numbers.
In general, implementing this algorithm is pretty straightforward. This is an an example function for the SoundEx algorithm using Xojo, which should be pretty easy to translate to other languages as needed:
Private Function SoundEx(word As Text) As Text Const kLength As Integer = 4 Dim value As Text Dim size As Integer = word.Length ' Make sure the word is at least two characters in length If (size > 1) Then word = word.Uppercase ' Convert the word to a character array for faster processing Dim chars() As Text = word.Split ' For storing the SoundEx character codes Dim code() As Text ' The current and previous character codes Dim prevCode As Integer = 0 Dim currCode As Integer = 0 ' Add the first character code.Append(chars(0)) Dim loopLimit As Integer = size - 1 ' Loop through all the characters and convert them to the proper character code For i As Integer = 0 To loopLimit Select Case chars(i) Case "H", "W" currCode = -1 Case "A", "E", "I", "O", "U", "Y" currCode = 0 Case "B", "F", "P", "V" currCode = 1 Case "C", "G", "J", "K", "Q", "S", "X", "Z" currCode = 2 Case "D", "T" currCode = 3 Case "L" currCode = 4 Case "M", "N" currCode = 5 Case "R" currCode = 6 End Select If i > 0 Then ' two letters With the same number separated by 'h' or 'w' are coded as a single number If currCode = -1 Then currCode = prevCode ' Check to see if the current code is the same as the last one If currCode <> prevCode Then ' Check to see if the current code is 0 (a vowel); do not proceed If currCode <> 0 Then code.Append(currCode.ToText) End If End If End If prevCode = currCode ' If the buffer size meets the length limit, then exit the loop If (code.Ubound = kLength - 1) Then Exit For End If Next ' Pad the code if required size = code.Ubound + 1 For j As Integer = size To kLength - 1 code.Append("0") Next ' Set the return value value = Text.Join(code, "") End If ' Return the computed soundex Return value End Function
You call the SoundEx function like this:
Dim result As Text result = SoundEx("Robert") ' R163 result = SoundEx("Rupert") ' R163 result = SoundEx("Rubin") ' R150 result = SoundEx("Ashcraft") ' A261 result = SoundEx("Ashcroft") ' A261 result = SoundEx("Tymczak") ' T522 result = SoundEx("Pfister") ' P236
By saving the SoundEx results (in SQLite, JSON or wherever) you can use them again to compare with SoundEx results on other values for better searching.