How to choose an entity resolution approach

A decision framework for picking the right entity resolution method. Dedup vs ER vs MDM, evaluating tools, tuning thresholds, streaming vs batch.

Most entity resolution projects fail in the scoping phase. Teams pick a tool before they have a clear answer to "what are we actually building?" The result is either an MDM platform deployed against a one-time CSV cleanup (overkill, slow, expensive) or a dedup script wired into a problem that needs cross-system linkage and stewardship (wrong shape, ships, breaks later).

This page is the decision hub. Work through the four questions below before evaluating any tool.

Step 1: Which of the four overlapping things do you need?

Deduplication, entity resolution, record linkage, and master data management get used interchangeably. They are not the same. Dedup operates on one dataset and removes redundant copies. ER identifies which records refer to the same entity, often within one dirty dataset. Record linkage matches records across two or more datasets. MDM is the operational and governance layer on top of an ER pipeline, with versioning, survivorship rules, and audit trails. Pick the scope before you pick the tool.

Step 2: How will you actually evaluate tools?

Most ER tool evaluations end up as a bake-off on a 100k-row demo dataset and pick the tool with the highest match count. That misses the questions that bite later: how does it behave at 10x your current scale, can you actually see why it merged two records, what does the audit trail look like, can it run inside your VPC, what is the cost curve as you grow. A seven-question pre-screen filters tools before you spend weeks on bake-offs.

Step 3: How tight should the threshold be?

The match threshold is the highest-leverage tuning knob in any ER project. Higher thresholds raise precision and drop recall (cleaner output, more duplicates surviving). Lower thresholds do the opposite. The right value depends on the cost shape of the use case: if a false positive breaks billing or patient safety, lean high; if duplicate records cause downstream chaos, lean low. Most real systems run two thresholds: one for auto-merge and a lower one for "send to review queue."

Step 4: Does the data flow in batches, or in real time?

Batch ER processes a snapshot. Streaming ER handles records as they arrive. Picking the wrong mode is expensive: batch on a real-time problem leaves a window where bad data flows downstream; streaming on a periodic problem pays continuous infrastructure cost for no benefit. Most production systems use streaming on the live path with a periodic batch rebuild that fixes drift. Plan for the rebuild, not just the live path.

Decision framework at a glance

If you have...	And the goal is...	Start with...
One dirty dataset, identifiers present	Faster lookups, smaller table	Deterministic dedup
One dirty dataset, identifiers missing or partial	Golden records with lineage	Probabilistic ER (Fellegi-Sunter weights, autoconfig)
Two or more datasets, no shared key	Cross-system match keys	Record linkage with blocking + scoring
Continuous feeds, ongoing stewardship	A system of record	MDM (ER + survivorship + audit + UI)
Cross-org, can't share raw data	Compliant federated matching	Privacy-preserving record linkage (PPRL, Enterprise roadmap)

Common scope mistakes

The two failure modes worth flagging:

Buying an MDM platform for a CSV cleanup. Six-figure procurement, three-month implementation, two engineers full-time. The problem could have been solved in a week with a dedup pipeline and a CSV export. Match the tool to the scope.
Shipping a dedup script when you need governance. Cleanup works for one run. The next month the data drifts, a new system feeds in, someone needs to know why entity X merged with entity Y last Tuesday. There is no audit trail. Now you need MDM features bolted onto a dedup script. Plan for the second run.

Where Golden Suite fits

Golden Suite covers the full range. The free tier handles dedup and one-shot ER cleanly. Pro adds the workbench (live threshold tuning, review queue, audit trail) for production deployments that need stewardship. Enterprise adds PPRL, streaming connectors, and the controls regulated data needs. The pricing page maps tiers to features. Start with the dedupe demo to feel the matching behavior on your own data before committing to anything.

Was this page helpful?

Edit this page on GitHub

PreviousConnecting S3 to Golden Suite NextDeduplication vs entity resolution vs record linkage vs MDM