Glossary
MDM & Entity Resolution Glossary
30 terms. Short definitions — depth lives in /docs/concepts and /docs/guides.
A
- Ambiguous merge
- A cluster of records that the matching engine flagged as "probably the same entity, but confidence is in the middle band — please review."
- Audit log
- An append-only record of every consequential action taken on master data — rule changes, manual merges, approved splits, exports.
B
- Blocking
- The first stage of entity resolution — narrowing the candidate-pair space so we only compare records that share some easy-to-compute signal.
C
- CLK(Cryptographic Linkage Key)
- A Bloom-filter encoding of a record's identifying fields, used to perform record linkage without exchanging raw data. The standard primitive behind PPRL.
- Clustering (in deduplication)
- The final entity-resolution stage where high-confidence pairs are grouped into connected components — one cluster per real-world entity.
D
- Data profiling
- Generating a statistical summary of a dataset — value distributions, null rates, cardinality, format patterns — before doing anything with it.
- Data quality
- How fit a dataset is for its intended use — measured along completeness, validity, consistency, uniqueness, timeliness, and accuracy.
- Data standardization
- Cleaning and normalizing field values to a canonical form before matching — phone E.164, address USPS-standard, name title-case.
- Data stewardship
- The human discipline of curating, approving, and correcting master data — the role that owns "is the customer record actually right?"
- Deterministic matching
- Matching based on exact equality of one or more shared identifiers (same SSN, same email, same tax ID). Fast and explainable; brittle when identifiers are missing.
E
- Entity resolution(ER)
- The process of identifying which records across one or more datasets refer to the same real-world entity. Foundational to dedup, MDM, and customer-360 work.
F
- F1 score
- The harmonic mean of precision and recall. The standard single-number quality metric for entity resolution engines, tracked per Suite version on a benchmark fixture.
- Fuzzy matching
- Comparing two values that are similar but not exactly equal: typos, capitalization, word order, missing punctuation. The workhorse of probabilistic matching.
G
- Golden record
- A synthesized canonical record produced by merging the contributing source rows for one real-world entity, with per-field survivorship and full source lineage.
I
- Incremental matching
- Re-running entity resolution on a source that grew by a small amount without re-scoring the entire dataset. Cheaper than full resolve; risks missing reshapes.
J
- Jaro-Winkler similarity
- A string similarity score (0 to 1) that favors matches with identical prefixes. Well-suited to person and company names where typos rarely occur at the start.
L
- Levenshtein distance
- A similarity metric that counts the minimum single-character edits (insert, delete, substitute) needed to transform one string into another.
- Lineage (data lineage)
- The chain that connects every value on a golden record back to its source rows, sources, and the pipeline event that produced the current state.
M
- Master data management(MDM)
- The discipline of producing one trusted version of each business entity (customer, vendor, product) from scattered sources.
P
- Phonetic matching
- Matching strings that sound alike but spell differently: "Cathryn" vs "Katherine", "Schmidt" vs "Schmitt". Often used as a blocking signal, not a final scorer.
- PPRL(Privacy-Preserving Record Linkage)
- Cross-organization entity resolution where neither side has to share raw records with the other or with a third party. Standard practice in healthcare and finance.
- Precision and recall (in matching)
- Two complementary metrics: precision measures how many of your merges are correct; recall measures how many real duplicates you actually merged.
- Probabilistic matching
- Matching that assigns a numeric similarity score (0 to 1) to each candidate pair, instead of a binary yes/no. The modern default for record linkage.
- Provenance
- A near-synonym of lineage, with extra emphasis on *who* and *when* — provenance is the audit chain of human + system actions, not just data flow.
R
- Record linkage
- A near-synonym of entity resolution, particularly in academic and healthcare contexts. Same underlying problem and techniques, different community vocabulary.
- Reference data
- The small, slow-changing lookup datasets that classify master data — country codes, currency codes, industry classifications, product categories.
S
- Schema inference
- Automatic proposal of a target-schema mapping for a new source — "this column looks like an email, this one looks like a date of birth."
- Schema mapping
- Translating each source's column names into the canonical target schema — "cust_name" + "customer_name" + "FullName" all become "name."
- Survival-rule bias
- The systematic effect of choosing one survivorship rule over another — e.g., "most recent" biases the golden record toward whichever source updates most often.
- Survivorship
- The rule set that decides which source value wins for each field on the golden record. Configured per-field, not per-record, and reviewed as data changes.