Record linkage: theory, algorithms, and tradeoffs

Record linkage identifies which records refer to the same real-world entity across datasets. Theory, algorithms, and modern tradeoffs.

Record linkage is the problem of identifying which records refer to the same real-world entity across one or more datasets. The same person registered twice with different spellings; the same company described differently in two CRMs; the same patient seen at two hospitals. The field is decades old, sits under most data-cleanup work that gets called something else, and has a small set of foundational algorithms that nearly every modern implementation can be traced back to.

It shows up everywhere. CRM deduplication, customer-360 builds, patient matching for health records, fraud detection (linking accounts across signals), sanctions screening (linking applicants to watchlists), product-catalog dedup, lead scoring, vendor-master data, regulatory filings. Newcombe's 1959 paper on medical record linkage and the Fellegi-Sunter 1969 paper formalizing the math are still the references practitioners cite.

This page is the theory hub. Each section below summarizes one part of the problem and links to a deeper spoke.

The deterministic vs probabilistic split

Two paradigms for deciding whether two records match. Deterministic uses fixed rules ("same email = same person"); probabilistic scores how similar two records look and combines those scores into a number. Deterministic is fast and fully explainable but brittle. Probabilistic handles the cases deterministic rules cannot reach but makes mistakes deterministic rules never would. Production systems run both: deterministic on identifier fields as a pre-pass, probabilistic on the remainder.

Similarity metrics

When the matcher is probabilistic, it needs a way to ask "how similar are these two strings, dates, or numbers?" Five families cover almost everything: edit-distance (Levenshtein) for typo-tolerant short strings, Jaro-Winkler for human names, token-set metrics (Jaccard, cosine) for multi-word fields, phonetic encoders (Soundex, Metaphone) for spelling-variant names, and absolute-difference metrics for ages, prices, and dates. Real systems mix metrics per field rather than picking one for the whole row.

The Fellegi-Sunter model

The math under nearly every modern probabilistic matcher. Each field gets two probabilities: m (agreement given the pair is a true match) and u (agreement given the pair is not). The ratio m/u is the agreement weight; strong identifiers like tax IDs produce huge weights, weak signals like "John" produce small ones. Weights are usually fit via expectation-maximization on labeled or unlabeled data. Splink and goldenmatch both expose EM-trained Fellegi-Sunter weights.

Read more on the Fellegi-Sunter model →

Blocking strategies

Naive record comparison is O(n^2). For 50,000 rows that is 1.25 billion pairs, which is not realistic. Blocking trades a tiny amount of recall for orders-of-magnitude speedup by only comparing records that share a blocking signal (first three letters of last name, soundex code, ZIP). Five common strategies: exact, phonetic, sorted-neighborhood, canopy, and learned blocking. Tighter blocking is faster but misses real duplicates; looser is slower but more recall. The choice depends on data shape.

Clustering

Scoring produces pairs. Clustering turns those pairs into entities. Connected components is the default ("if A matches B and B matches C, treat them as one entity"). Hierarchical and correlation clustering give finer control but cost more. Every cluster ends up confident (auto-merged), ambiguous (sent to review), or rejected. The ratio of confident to ambiguous clusters over time is the main quality signal for a tuned pipeline.

Privacy-preserving record linkage

When two organizations need to match records without revealing raw identifiers (cross-hospital patient matching, sanctions screening, federated data partnerships), the standard approach is privacy-preserving record linkage. Bloom-filter encoding ("cryptographic long-term key" or CLK) is the workhorse: each side hashes q-grams of identifiers into bit vectors that can be compared with Dice or Jaccard similarity. Other primitives: k-anonymity, secure multi-party computation, differential privacy on outputs.

Deterministic vs probabilistic at a glance

Property	Deterministic	Probabilistic
Decision rule	Fixed predicate over fields	Per-pair score above threshold
Precision	Very high when rule is right	Tunable; trades against recall
Recall on dirty data	Low	High
Explainability	Trivial ("R3 fired on email")	Per-field score breakdown
Setup cost	Low	Medium (EM training, threshold tuning)
Runtime	Fastest	Acceptable with good blocking
Right for	Identifier-rich data with stable keys	Dirty data without reliable identifiers

Where Golden Suite fits

Golden Suite runs the full pipeline end-to-end: blocking + scoring + clustering with autoconfig defaults that work on a first run, plus the workbench primitives (review queue, audit trail, threshold tuning) that production deployments need. The engine layer is open source as goldenmatch. Try the interactive dedupe demo on your own CSV before committing to anything.

Was this page helpful?

Edit this page on GitHub

PreviousTrust tiers NextDeterministic vs probabilistic record matching