CIS 163 - Programming I Java - Section 5811
Programming Assignment #2
Due Monday, December 5, 2005

SOUNDEX "Fuzzy" Match Function

Write a routine to convert a word or a name to a four character Soundex code as described below. The routine developed in this assignment may well be handy in future applications that you may write in Java. The Soundex algorithm is the most commonly used "fuzzy" match algorithm used to look up people's names in lists.

The Soundex algorithm is probably the oldest of the "fuzzy match" algorithms. It was patented twice in the 1920's! It is documented best in Donald Knuth's "Sorting and Searching Algorithms". The basic intent of the Soundex algorithm code is to convert an input word, usually a person's name, into a four-character (one alphabet character and 3 digits) representation of how the word "sounds", rather than depending on exact spelling. The Soundex code for a word is a "many to one" code, i.e. many words will convert to the same Soundex code but there is only one Soundex code for a given word.

The algorithm works as follows:

  1. Convert the input word to all upper case and remove all spaces and non-alphabetic symbols.

  2. Retain the first character of the input word as the first character of the 4 character Soundex code.

  3. Convert each of the characters to a number (an ASCII digit) representing its "sound" (the table of the numbers to use for each of the letters follows later).

  4. Combine all double numbers into one.

  5. Replace the first number with the original first letter.

  6. Delete all of the zeros.

  7. Return only the first 4 characters of the string. The initial letter and the three number codes is the Soundex code for the word.

  8. If the Soundex code developed is less than 4 characters, pad out to four characters with '0's - ASCII zeros.

Code Table:

The following table gives the letters that each code number represents:

Code 0 - A, E, I, O, U, H, W, Y
Code 1 - B, F, P, V
Code 2 - C, G, J, K, Q, S, X, Z
Code 3 - D, T
Code 4 - L
Code 5 - M, N
Code 6 - R

A list of names with the corresponding Soundex codes will be available to use as verification of your Soundex program

The most familiar application of Soundex is its use by the US Bureau of the Census to create an index for individuals listed in the US census records after 1880. The 1880, 1900, and 1910 US Censuses are indexed on microfilm by the National Archives using the Soundex Code. This index was prepared for Social Security purposes in the 1930's. The 1880 Soundex only indexes those households having children ten years of age and younger. The 1900 Soundex indexes all households and the 1910 Soundex indexes all households in only 21 states.

The Soundex index is grouped by all the last names (surnames) for a particular state using the phonetic Soundex Code. In order to use the index you must know the Soundex Code for the Surname you are looking for.