Soundex, the phonetic matching algorithm revisited

According to Wikipedia, Soundex is a phonetic algorithm for indexing names by sound as pronounced in English. The goal is for homophones to be encoded to the same representation so that they can be matched despite minor differences in spelling.

In simple terms, we just segregate words by their pronunciation.

Soundex was developed by Robert C. Russell and Margaret King Odell. It was later adopted for US census with modifications and was termed as American Soundex. Currently the rules are defined and maintained by National Archives. You can find details here.

https://www.archives.gov/research/census/soundex

Uses of Soundex

Currently the largest use of Soundex is for US census to locate information about a person by name. Normally the entire family is coded under name of the head of household. The other sites that consistently use Soundex is where genealogy is maintained. Primarily it is used for comparing family names. This is because same last names may be spelled in different ways. Using Soundex we can group them into similar buckets. For example Smith and Smyth may actually be the same last name.

Every database in use today provides Soundex for searching similar last names. You are of course not restricted to last name search. Any word can be searched with similar pronunciations. It can be useful when you have spelled a word incorrectly, correct words can be recommended.

Coding Rules of Soundex

Soundex codes are denoted using four characters. The first one is the first letter of the word. Subsequent letters are marked with a number based on phonetic. Following table represents the phonetic numbering chart. It is based on how each of these letters are pronounced.

LettersSoundex Code
B, F, P, V1
C, G, J, K, Q, S, X, Z2
D, T3
L4
M, N5
R6

We disregard the letters A, E, I, O, U, H, W, and Y. These words do not play any significant role in pronunciation.

Additionally, American Soundex adds following criteria,

  • Remove any letters that repeat more than once. E.g. code of Gutierrez is G-362 (G, 3 for the T, 6 for the first R, second R ignored, 2 for the Z).
  • Remove any letters that are side by side and has the same Soundex code. E,g. code of Pfister is P236 (P, F ignored, 2 for the S, 3 for the T, 6 for the R).
  • Remove some common prefixes from names viz. Van, Con, De, Di, La, or Le. E.g. Code for VanDeusen is D250 or V532. This depends on if we remove or do not remove the prefixes.
    • American Soundex codes them under both values.
  • Do not ignore repeating consonants separated by vowels (A, E, I, O, U). E.g. code for Tymczak is T522 (T, 5 for the M, 2 for the C, Z ignored (see “Side-by-Side” rule above), 2 for the K).
  • Ignore consonants separated by H and W. E.g. Code for Ashcraft is A261 (A, 2 for the S, C ignored, 6 for the R, 1 for the F).

Finally if the code is less than 4 characters, pad zeroes. If more, then we truncate after four characters.

Start Coding

Now that we have the algorithm out of the way, let’s start coding it. We will build a Python program as that is quicker than rest. Firstly we will start with a base class.

class Soundex:
    """
        Create a dictionary that will be used
        for the soundex codes
    """
    soundex_dict = {
        "BFPV": "1",
        "CGJKQSXZ": "2",
        "DT": "3",
        "L": "4",
        "MN": "5",
        "R": "6"
    }

    """
        Other globals
    """
    pref_3letter = ["VAN", "CON"]
    pref_2letter = ["DI", "DE", "LA", "LE"]

We are keeping couple of globals in the program. The first global just defines the table we mentioned above. Also, we will use the second set of globals later to remove some common last name prefixes.

Next, let’s write a method to just return the Soundex code for a letter.

"""
    Get Soundex for a Letter
"""
def get_soundex(self, letter) -> str:
    for k in self.soundex_dict.keys():
        if letter in k:
            return self.soundex_dict[k]
        
    return "-1"

Using the code above, we can just send a letter to get the Soundex code. It iterates over all the keys and returns the value if it finds one. If it cannot find a value, it returns -1.

Soundex, finally

With helper functions out of the way, let’s start building the real API.

"""
    National Archive variations for last name encoding.
    1. Double letters are removed
    2. Side by side letters with same soundex are removed
    3. Names with prefixes like Van, Con, Di, De, La, Le removed
        4. If a vowel separates two consonants with same code, only keep the right
        5. If "H" or "W" separate two consonants then, the value on right is not coded

        Simplifying, so we get the same result,
        1. Remove Van, Con, Di, De, La, Le is they are first letters
        2. Remove all vowels (unless it is the first one)
        3. If two consecutive letters have same code, remove right one
        4. If there is a vowel between two consonants, code both of them
    """
    def generate_soundex(self, word, rmpfx) -> str:
        word = word.upper()
        if rmpfx:
            if word.startswith(tuple(self.pref_3letter)):
                word = word[3:]
            elif word.startswith(tuple(self.pref_2letter)):
                word = word[2:]

        ret = word[0]
        last_sx = self.get_soundex(ret)
        for c in word[1:]:
            if c in "HW":
                # Right not coded
                continue

            v = self.get_soundex(c)
            if v!= "-1" and v != last_sx:
                ret += v
            last_sx = v

        if len(ret) < 4:
            ret = ret.ljus(4, '0')
        return ret[:4]

From line #18 through #21 we are removing common prefixes. Like I said before, National Archive files these names under both codes. For example, french last name LeBorgne may use Le as a prefix, but the Chinese last name Lee definitely does not. So, if we remove prefix from Lee, we will code as E000. However, the correct code in this case would be L000. I do not have a comprehensive list of last names to segregate this, so I did the next best thing. I have a parameter that we can set to enable/disable prefix modifications.

Next we keep on iterating over letters in words and start building the Soundex code. As we have defined before, H and W follow a special case. If H or W separates two consonants with same code, we ignore the right hand letter. We do this check on line #26.

Finally, starting line #35, we pad or truncate the code formed and send back to caller.

Testing

Let’s write some test cases in main. Any Python purist will ask why I have not used asserts as I am comparing values. I just wanted to see the values generated, no other reason. Also don’t question why I am using two arrays here. I added the second array more as an after-thought :). So, my main looks as follows,

if __name__ == "__main__":
    last_names = [
        "Gutierrez", "Pfister", "Jackson", "VanDeusen", "Tymczak", "Ashcraft",
        "Wohrzhick", "Warzick", "Smyth", "Washington", "Lee", "Scott"
    ]
    last_names_sx = [
        "G362", "P236", "J250", "D250", "T522", "A261", 
        "W622", "W622", "S530", "W252", "L000", "S300"
    ]

    sx = Soundex()
    for idx, lname in enumerate(last_names):
        sx_val = sx.generate_soundex(lname, True)
        sx_exp = last_names_sx[idx]
        print("Soundex for {}: {}, expected: {} {}".format(
              lname, sx_val, sx_exp, "" if sx_val == sx_exp else "<---- ERROR")
        )

We will remove the prefixes while we try to calculate the code. I have taken these Soundex codes from National Archives page and some from a Princeton university site. Let’s see how we performed.

Soundex for Gutierrez: G362, expected: G362 
Soundex for Pfister: P236, expected: P236
Soundex for Jackson: J250, expected: J250
Soundex for VanDeusen: D250, expected: D250
Soundex for Tymczak: T522, expected: T522
Soundex for Ashcraft: A261, expected: A261
Soundex for Wohrzhick: W622, expected: W622
Soundex for Warzick: W622, expected: W622
Soundex for Smyth: S530, expected: S530
Soundex for Washington: W252, expected: W252
Soundex for Lee: E000, expected: L000 <---- ERROR
Soundex for Scott: S300, expected: S300

Not bad, but we failed on one. This is because for Lee, we have assumed Le to be a prefix and removed it. Let’s try with prefix removal disabled. In this case we know that we should fail on VanDeusen as the validation code is with Van removed.

Soundex for Gutierrez: G362, expected: G362 
Soundex for Pfister: P236, expected: P236
Soundex for Jackson: J250, expected: J250
Soundex for VanDeusen: V532, expected: D250 <---- ERROR
Soundex for Tymczak: T522, expected: T522
Soundex for Ashcraft: A261, expected: A261
Soundex for Wohrzhick: W622, expected: W622
Soundex for Warzick: W622, expected: W622
Soundex for Smyth: S530, expected: S530
Soundex for Washington: W252, expected: W252
Soundex for Lee: L000, expected: L000
Soundex for Scott: S300, expected: S300

As expected, we failed on VanDeusen. However, this is a more true state as this last name will be filed both under D250 and V532.

Difference

We cannot end the blog without talking about difference. DBs provide this method for Soundex comparison. How does difference work? Difference provides a similarity score between two words. It compares each letter of the generated code, and for each match adds a 1. Let’s take an example.

Code for SMITH is S530. Same code also applies for SMYTH. So, the similarity score is 4.

Code for BARNEY is B650 and that of BONNIE is B500. Position 1 and 4 matches. So, similarity score is 2.

Code for MICKEY is M200 and that of HAROLD is H643. Nothing matches in this case, so similarity score is 0.

So, what we see here is that DIFFERENCE provides the similarity score. When the score is 0, it did not match. A score of 4 will be a very good match.

"""
    Find the difference - 
    0: different -> 4: similar
"""
def difference(self, word1, word2) -> int:
    sdx1 = self.generate_soundex(word1, False)
    sdx2 = self.generate_soundex(word2, False)
    print("Soundex: {} -> {} & {} -> {}".format(word1, sdx1, word2, sdx2))
    idx = 0
    for x in range(len(sdx1)):
        if sdx1[x] == sdx2[x]:
            idx += 1
    return idx

This code is pretty straight forward. Calculate the Soundex values for the words, and compare each position. Finally return the count. Let’s run the code for values we discussed earlier.

Soundex: SMITH -> S530 & SMYTH -> S530
Difference: 4
Soundex: BARNEY -> B650 & BONNIE -> B500
Difference: 2
Soundex: MICKEY -> M200 & HAROLD -> H643
Difference: 0

Conclusion

That’s Soundex for you. We only use it in very specific scenarios, but it is always a good algorithm to keep under your belt. Ciao for now!