Entity resolution
Blocking, scoring, and clustering — what each step does and why the order matters.
Entity resolution is the algorithm that takes 50,000 messy rows and produces 38,000 golden records by identifying which inputs refer to the same real-world entity. It runs in three stages, in this order.
1. Blocking
The naive approach — compare every pair of rows — is O(n²). For 50,000 rows that's 1.25 billion pairs. Not realistic.
Blocking trades a tiny bit of recall for a massive speedup: only rows that share a blocking signal are considered as candidates. A blocking signal is something like:
- First three letters of last name
- Soundex code of full name
- Domain part of email address
- ZIP code
Pick the right signals and you reduce 1.25B comparisons to a few million while still catching the matches that matter. Pick the wrong ones and you miss real duplicates because they never made it into a common block.
Golden Suite's auto-config inspects your columns and proposes blocking signals. You can always override.
2. Scoring
For each candidate pair surfaced by blocking, the scorer asks "how similar are these two rows?" and produces a number between 0 and 1.
Scoring is per-field, not row-level. Two rows might score 0.95 on name (clearly a fuzzy match), 1.00 on email (exact match), and 0.20 on address (different cities). The pair-level score combines those, weighted by how much each field tells you.
Common scorers:
- Exact — string equality, case-sensitive
- Jaro-Winkler / Levenshtein — character-level fuzzy match for names and addresses
- Token-set — order-independent for multi-word fields ("Smith John" matches "John Smith")
- Phonetic — sound-alike matching for names with spelling variants ("Cathryn" vs "Katherine")
- Custom — anything you can express as a function from two values to a 0–1 number
The postflight report tells you which scorers contributed most to each cluster. When a scorer underperforms — its weight gets demoted in the final score — that shows up too. The Engine quality admin page tracks the demoted-scorer rate over time.
3. Clustering
Scoring produces pairs. Clustering produces groups.
If A matches B, and B matches C, but A and C never made it into a common block — clustering still groups them together. The algorithm walks the graph of high-confidence pairs and produces connected components.
This is the step that occasionally needs human review. A cluster might be:
- Confident — every pair inside the cluster scored above the high-confidence threshold. Auto-merged.
- Ambiguous — some pairs scored in the middle band. Flagged in the review queue.
- Rejected — one or more pairs scored below the merge threshold. Cluster is split.
Ambiguous clusters land in the review queue for a human to approve, split, or merge.
Why the order matters
You can't score what blocking didn't surface. You can't cluster what scoring didn't connect. Each stage feeds the next, and a bad call early on can't be recovered later.
The trade-offs are real:
- Loose blocking → high recall, slow runtime, more pairs for the scorer
- Tight blocking → fast, but real duplicates can be missed
- High match threshold → high precision, more false-negatives
- Low match threshold → high recall, more ambiguous clusters that need review
The defaults are tuned for "messy customer CSV" (the most common starting point). The workbench surfaces every knob.