Similarity metrics for record matching
Jaro-Winkler, Levenshtein, Jaccard, cosine, and phonetic metrics: what each measures, when to use it, and tradeoffs for record matching.
A similarity metric is a function that takes two field values and returns a number between 0 and 1. In record matching, these scores are the raw material the scorer uses to decide whether two rows refer to the same entity.
The five families
- Edit-distance (Levenshtein). Counts the minimum single-character insertions, deletions, and substitutions to turn one string into another. "Smith" vs "Smyth" is 1 edit. Good fit for short identifiers, names, and codes where the error is a typo or transcription slip.
- Jaro-Winkler. Built on character transpositions with an extra boost for shared prefixes. Scores "MARTHA" vs "MARHTA" higher than raw edit-distance would. The prefix boost makes it the default choice for human names, where first-letter errors are rare and transpositions are common.
- Token-set (Jaccard / cosine). Splits each string into tokens and measures overlap, ignoring order. "John Smith Enterprises" and "Enterprises, Smith John" score near 1.0. Right tool for addresses, company names, and any multi-word field where word order is unreliable.
- Phonetic (Soundex / Metaphone). Encodes strings by how they sound, not how they are spelled. "Cathryn" and "Katherine" collapse to the same code. Useful for name fields where spelling variants are wide. Christen (2012) covers phonetic encoding trade-offs in depth.
- Numeric / date metrics. Absolute difference or percentage difference between numbers or timestamps. Age 32 vs age 33 scores higher than age 32 vs age 62. Needed for DOB, price, and any field where magnitude matters.
Try the metrics on a string pair:
Deduplicate records with interactive config tuning. Upload a CSV, tweak match thresholds and field weights, and watch clusters form in real-time.
Drop CSV here
Picking a metric
Short fields like names and product codes respond well to Jaro-Winkler or a phonetic encoder; the strings are compact and errors are typically a single character or pronunciation variant. Long fields like addresses and descriptions are better served by token-set metrics, which tolerate word-order differences and partial matches without being thrown off by extra tokens. In practice you will mix approaches: Jaro-Winkler on first_name and last_name, Jaccard on company, exact match on email, and absolute difference on birth_year. Per-field scoring is normal and expected.
In Golden Suite
goldenmatch autoconfig inspects each column's data distribution and proposes the metric most likely to surface true matches. You can override any field's metric in the workbench config panel before dispatching. The demoted-scorer rate in /admin/health flags when a chosen metric is contributing less than expected to final scores over the past 7 days, which is an early signal to revisit the metric choice for that field.