The Fellegi-Sunter model of record linkage

The 1969 Fellegi-Sunter model uses m and u probabilities to score record pairs. Still the foundation under modern probabilistic entity resolution.

The 1969 Fellegi-Sunter model is the math under most modern probabilistic record linkage. It formalizes the question: given a pair of records that agree on some fields and disagree on others, how likely is this pair a true match?

m and u probabilities

Every field in a comparison gets two numbers.

m_field is the probability of agreement on that field given the pair is a true match. Think of it as the true-positive rate per field. A tax ID should agree almost every time two records are really the same entity, so m_taxid is close to 1.0.

u_field is the probability of agreement on that field given the pair is NOT a true match. Think of it as the random-agreement rate. Even among non-matches, some fields will agree by coincidence.

The ratio m/u is the agreement weight. Strong identifiers, like a rare last name or an exact tax ID, have m near 1 and u near 0, producing a large positive weight. Weak signals, like the first name "John" or a high-density ZIP code, have m near 1 but u somewhere in the 0.05 to 0.10 range, producing a much smaller weight. Field-level scores combine multiplicatively across all fields in a pair, so a single high-weight identifier can dominate.

Source: Fellegi & Sunter (1969). "A Theory for Record Linkage." Journal of the American Statistical Association.

Estimating m and u

These probabilities are usually unknown upfront. The standard approach is expectation-maximization (EM): the algorithm alternates between assigning soft match/non-match labels to pairs and re-estimating m and u to fit those labels, converging to a stable set of weights. It works on unlabeled data, which is why it's practical.

Both Splink and goldenmatch expose EM-trained weights. One important caveat: the estimates are sensitive to blocking choices. If blocking is too tight, the candidate pool skews toward near-matches and u becomes biased upward, making weights unreliable.

In Golden Suite

goldenmatch ships an EM trainer and surfaces the learned weights in the postflight report. The autoconfig path picks reasonable blocking signals and initial field settings, so first-run users get credible weights without touching the EM step directly. You can inspect and override any weight from the workbench once the run completes.