Clustering in record linkage and entity resolution

How clustering turns pairwise match scores into entity groups. Connected components, hierarchical, and correlation clustering compared.

Scoring produces pairs. Clustering produces entities. The scorer says "A matches B at 0.92" and "B matches C at 0.88" but stays silent on whether A, B, and C are all the same entity. Clustering answers that question by collapsing the pairwise graph into connected groups.

Three algorithms

  • Connected components. The simplest and most common choice. Treat every pair above the match threshold as an edge; every connected subgraph is one entity. Fast, scales to millions of records, no parameters beyond the threshold. The downside is transitivity: if A-B and B-C both score 0.85 but A-C scores only 0.30, connected components still merges all three.
  • Hierarchical (agglomerative). Build a dendrogram by repeatedly merging the highest-scoring pair, then cut the tree at a chosen height. Gives finer control over cluster shape and exposes a knob ("max diameter inside a cluster") that connected components lacks. Slower; O(n^2 log n) in the worst case.
  • Correlation clustering. Treats both high-confidence matches and high-confidence non-matches as signals and minimizes the number of disagreements. Best theoretical guarantees but expensive to solve exactly; production systems use approximation algorithms.

What gets reviewed

A cluster falls into one of three states:

  • Confident. Every internal pair scored above the high-confidence threshold. Auto-merged.
  • Ambiguous. At least one internal pair landed in the middle band. Routed to the review queue for a human to approve the merge, split the cluster, or leave it pending.
  • Rejected. A pair scored below the merge threshold. The cluster is split before it ever lands on a reviewer's screen.

Tracking the ratio of confident-to-ambiguous clusters over time is the main quality signal for a tuned ER pipeline. A rising ambiguous rate means the data drifted, the threshold needs revisiting, or both.

In Golden Suite

goldenmatch defaults to connected components and reports cluster-level statistics in the postflight blob: count, size distribution, lowest internal pair score, demoted-scorer rate. Ambiguous clusters surface in the review queue with the offending pair highlighted, so the reviewer sees the exact reason a cluster needs human judgment instead of a generic "needs review" flag.