Record linkage: theory, algorithms, and tradeoffs
Record linkage identifies which records refer to the same real-world entity across datasets. Theory, algorithms, and modern tradeoffs.
Record linkage is the problem of identifying which records refer to the same real-world entity across one or more datasets. The same person registered twice with different spellings; the same company described differently in two CRMs; the same patient seen at two hospitals. The field is decades old, sits under most data-cleanup work that gets called something else, and has a small set of foundational algorithms that nearly every modern implementation can be traced back to.
It shows up everywhere. CRM deduplication, customer-360 builds, patient matching for health records, fraud detection (linking accounts across signals), sanctions screening (linking applicants to watchlists), product-catalog dedup, lead scoring, vendor-master data, regulatory filings. Newcombe's 1959 paper on medical record linkage and the Fellegi-Sunter 1969 paper formalizing the math are still the references practitioners cite.
This page is the theory hub. Each section below summarizes one part of the problem and links to a deeper spoke.
The deterministic vs probabilistic split
Two paradigms for deciding whether two records match. Deterministic uses fixed rules ("same email = same person"); probabilistic scores how similar two records look and combines those scores into a number. Deterministic is fast and fully explainable but brittle. Probabilistic handles the cases deterministic rules cannot reach but makes mistakes deterministic rules never would. Production systems run both: deterministic on identifier fields as a pre-pass, probabilistic on the remainder.
Read more on deterministic vs probabilistic matching →
Similarity metrics
When the matcher is probabilistic, it needs a way to ask "how similar are these two strings, dates, or numbers?" Five families cover almost everything: edit-distance (Levenshtein) for typo-tolerant short strings, Jaro-Winkler for human names, token-set metrics (Jaccard, cosine) for multi-word fields, phonetic encoders (Soundex, Metaphone) for spelling-variant names, and absolute-difference metrics for ages, prices, and dates. Real systems mix metrics per field rather than picking one for the whole row.
Read more on similarity metrics →
The Fellegi-Sunter model
The math under nearly every modern probabilistic matcher. Each field gets two probabilities: m (agreement given the pair is a true match) and u (agreement given the pair is not). The ratio m/u is the agreement weight; strong identifiers like tax IDs produce huge weights, weak signals like "John" produce small ones. Weights are usually fit via expectation-maximization on labeled or unlabeled data. Splink and goldenmatch both expose EM-trained Fellegi-Sunter weights.
Read more on the Fellegi-Sunter model →
Blocking strategies
Naive record comparison is O(n^2). For 50,000 rows that is 1.25 billion pairs, which is not realistic. Blocking trades a tiny amount of recall for orders-of-magnitude speedup by only comparing records that share a blocking signal (first three letters of last name, soundex code, ZIP). Five common strategies: exact, phonetic, sorted-neighborhood, canopy, and learned blocking. Tighter blocking is faster but misses real duplicates; looser is slower but more recall. The choice depends on data shape.
Read more on blocking strategies →
Clustering
Scoring produces pairs. Clustering turns those pairs into entities. Connected components is the default ("if A matches B and B matches C, treat them as one entity"). Hierarchical and correlation clustering give finer control but cost more. Every cluster ends up confident (auto-merged), ambiguous (sent to review), or rejected. The ratio of confident to ambiguous clusters over time is the main quality signal for a tuned pipeline.
Privacy-preserving record linkage
When two organizations need to match records without revealing raw identifiers (cross-hospital patient matching, sanctions screening, federated data partnerships), the standard approach is privacy-preserving record linkage. Bloom-filter encoding ("cryptographic long-term key" or CLK) is the workhorse: each side hashes q-grams of identifiers into bit vectors that can be compared with Dice or Jaccard similarity. Other primitives: k-anonymity, secure multi-party computation, differential privacy on outputs.
Read more on privacy-preserving record linkage →
Deterministic vs probabilistic at a glance
| Property | Deterministic | Probabilistic |
|---|---|---|
| Decision rule | Fixed predicate over fields | Per-pair score above threshold |
| Precision | Very high when rule is right | Tunable; trades against recall |
| Recall on dirty data | Low | High |
| Explainability | Trivial ("R3 fired on email") | Per-field score breakdown |
| Setup cost | Low | Medium (EM training, threshold tuning) |
| Runtime | Fastest | Acceptable with good blocking |
| Right for | Identifier-rich data with stable keys | Dirty data without reliable identifiers |
Where Golden Suite fits
Golden Suite runs the full pipeline end-to-end: blocking + scoring + clustering with autoconfig defaults that work on a first run, plus the workbench primitives (review queue, audit trail, threshold tuning) that production deployments need. The engine layer is open source as goldenmatch. Try the match playground on your own CSV before committing to anything.