Blocking strategies for entity resolution
Blocking reduces O(n^2) record comparisons to a tractable subset. Exact, phonetic, sorted-neighborhood, canopy, and learned blocking explained.
Blocking reduces the O(n^2) record-comparison cost to a tractable subset by only comparing records that share a "blocking signal." For 50,000 rows, naive comparison is 1.25 billion pairs. A decent blocker drops that to around 5 million.
Five strategies
- Exact blocking. Records must share an identical token, such as the first three letters of a last name, a zip code, or a soundex code. Fast and easy to debug. Misses matches when the blocking column is dirty or inconsistently entered.
- Phonetic blocking. Block on the soundex or metaphone encoding of a name field. Catches "Cathryn" and "Katherine" pairs that exact blocking would split into separate blocks.
- Sorted-neighborhood. Sort records by a composite key, then slide a window of size w and compare every pair inside the window. Robust to single-character errors at the front of the key because neighboring rows stay close even with small typos.
- Canopy clustering. Use a cheap similarity threshold to form overlapping "canopies," then run the expensive scorer only inside each canopy. A record can live in more than one canopy, so near-duplicate pairs still get a chance to meet.
- Learned blocking. Train a model to predict which pairs are worth scoring. Most expensive to set up, but delivers the highest recall once tuned on labeled data.
The recall and runtime tradeoff
Loose blocking surfaces more candidate pairs (high recall, slower downstream scoring). Tight blocking is fast but silently drops real duplicates that never landed in a shared block. The right choice depends on data shape, field dirtiness, and how much recall the use case demands. Christen (2012) covers the full taxonomy with benchmarks across all five strategies.
In Golden Suite
goldenmatch autoconfig inspects each field's cardinality and entropy and proposes a blocker per column. You can stack multiple blockers, such as canopy on full name plus exact on email domain, and the union of candidate pairs feeds the scorer. That union step is where the system trades extra comparisons for higher recall before the expensive similarity pass.