Phonetic matching
Matching strings that sound alike but spell differently — "Cathryn" vs "Katherine", "Schmidt" vs "Schmitt".
Phonetic algorithms encode a string into a code based on how it sounds. Two strings with the same phonetic code are treated as potential matches even if their spelling differs significantly.
Common encoders:
- Soundex — invented in the early 1900s for the US census; codes are one letter + three digits. Crude but extremely fast.
- Metaphone / Double Metaphone — better handling of non-English names; widely used today.
- NYSIIS — variant of Soundex with better accuracy on Irish/Hispanic surnames.
Phonetic matching is typically used as a *blocking* signal (group records with the same code, then run finer-grained scoring within each block) rather than as a final-decision scorer. The accuracy is too coarse to merge on directly but plenty good enough to keep "Cathryn" and "Katherine" in the same candidate-pair set.