Glossary

MDM & Entity Resolution Glossary

30 terms. Short definitions — depth lives in /docs/concepts and /docs/guides.

A

Ambiguous merge
A cluster of records that the matching engine flagged as "probably the same entity, but confidence is in the middle band — please review."
Audit log
An append-only record of every consequential action taken on master data — rule changes, manual merges, approved splits, exports.

B

Blocking
The first stage of entity resolution — narrowing the candidate-pair space so we only compare records that share some easy-to-compute signal.

C

CLK(Cryptographic Linkage Key)
A Bloom-filter encoding of a record's identifying fields, used to perform record linkage without exchanging raw data. The standard primitive behind PPRL.
Clustering (in deduplication)
The final entity-resolution stage where high-confidence pairs are grouped into connected components — one cluster per real-world entity.

D

Data profiling
Generating a statistical summary of a dataset — value distributions, null rates, cardinality, format patterns — before doing anything with it.
Data quality
How fit a dataset is for its intended use — measured along completeness, validity, consistency, uniqueness, timeliness, and accuracy.
Data standardization
Cleaning and normalizing field values to a canonical form before matching — phone E.164, address USPS-standard, name title-case.
Data stewardship
The human discipline of curating, approving, and correcting master data — the role that owns "is the customer record actually right?"
Deterministic matching
Matching based on exact equality of one or more shared identifiers (same SSN, same email, same tax ID). Fast and explainable; brittle when identifiers are missing.

E

Entity resolution(ER)
The process of identifying which records across one or more datasets refer to the same real-world entity. Foundational to dedup, MDM, and customer-360 work.

F

F1 score
The harmonic mean of precision and recall. The standard single-number quality metric for entity resolution engines, tracked per Suite version on a benchmark fixture.
Fuzzy matching
Comparing two values that are similar but not exactly equal: typos, capitalization, word order, missing punctuation. The workhorse of probabilistic matching.

G

Golden record
A synthesized canonical record produced by merging the contributing source rows for one real-world entity, with per-field survivorship and full source lineage.

I

Incremental matching
Re-running entity resolution on a source that grew by a small amount without re-scoring the entire dataset. Cheaper than full resolve; risks missing reshapes.

J

Jaro-Winkler similarity
A string similarity score (0 to 1) that favors matches with identical prefixes. Well-suited to person and company names where typos rarely occur at the start.

L

Levenshtein distance
A similarity metric that counts the minimum single-character edits (insert, delete, substitute) needed to transform one string into another.
Lineage (data lineage)
The chain that connects every value on a golden record back to its source rows, sources, and the pipeline event that produced the current state.

M

Master data management(MDM)
The discipline of producing one trusted version of each business entity (customer, vendor, product) from scattered sources.

P

Phonetic matching
Matching strings that sound alike but spell differently: "Cathryn" vs "Katherine", "Schmidt" vs "Schmitt". Often used as a blocking signal, not a final scorer.
PPRL(Privacy-Preserving Record Linkage)
Cross-organization entity resolution where neither side has to share raw records with the other or with a third party. Standard practice in healthcare and finance.
Precision and recall (in matching)
Two complementary metrics: precision measures how many of your merges are correct; recall measures how many real duplicates you actually merged.
Probabilistic matching
Matching that assigns a numeric similarity score (0 to 1) to each candidate pair, instead of a binary yes/no. The modern default for record linkage.
Provenance
A near-synonym of lineage, with extra emphasis on *who* and *when* — provenance is the audit chain of human + system actions, not just data flow.

R

Record linkage
A near-synonym of entity resolution, particularly in academic and healthcare contexts. Same underlying problem and techniques, different community vocabulary.
Reference data
The small, slow-changing lookup datasets that classify master data — country codes, currency codes, industry classifications, product categories.

S

Schema inference
Automatic proposal of a target-schema mapping for a new source — "this column looks like an email, this one looks like a date of birth."
Schema mapping
Translating each source's column names into the canonical target schema — "cust_name" + "customer_name" + "FullName" all become "name."
Survival-rule bias
The systematic effect of choosing one survivorship rule over another — e.g., "most recent" biases the golden record toward whichever source updates most often.
Survivorship
The rule set that decides which source value wins for each field on the golden record. Configured per-field, not per-record, and reviewed as data changes.