Glossary
MDM & Entity Resolution Glossary
30 terms. Short definitions — depth lives in /docs/concepts and /docs/guides.
A
- Ambiguous merge
- A cluster of records that the matching engine flagged as "probably the same entity, but confidence is in the middle band — please review."
- Audit log
- An append-only record of every consequential action taken on master data — rule changes, manual merges, approved splits, exports.
B
- Blocking
- The first stage of entity resolution — narrowing the candidate-pair space so we only compare records that share some easy-to-compute signal.
C
- CLK(Cryptographic Linkage Key)
- A Bloom-filter encoding of a record's identifying fields, used to perform record linkage without exchanging raw data.
- Clustering (in deduplication)
- The final entity-resolution stage where high-confidence pairs are grouped into connected components — one cluster per real-world entity.
D
- Data profiling
- Generating a statistical summary of a dataset — value distributions, null rates, cardinality, format patterns — before doing anything with it.
- Data quality
- How fit a dataset is for its intended use — measured along completeness, validity, consistency, uniqueness, timeliness, and accuracy.
- Data standardization
- Cleaning and normalizing field values to a canonical form before matching — phone E.164, address USPS-standard, name title-case.
- Data stewardship
- The human discipline of curating, approving, and correcting master data — the role that owns "is the customer record actually right?"
- Deterministic matching
- Matching based on exact equality of one or more shared identifiers — same SSN, same email, same tax ID.
E
- Entity resolution(ER)
- The process of identifying which records across one or more datasets refer to the same real-world entity.
F
- F1 score
- The harmonic mean of precision and recall — the standard single-number quality metric for entity resolution engines.
- Fuzzy matching
- Comparing two values that are similar but not exactly equal — typos, capitalization, word order, missing punctuation.
G
- Golden record
- A synthesized canonical record produced by merging the contributing source rows for one real-world entity.
I
- Incremental matching
- Re-running entity resolution on a source that grew by a small amount without re-scoring the entire dataset.
J
- Jaro-Winkler similarity
- A string similarity score (0-1) that favors matches with identical prefixes — well-suited to person and company names.
L
- Levenshtein distance
- A similarity metric that counts the minimum single-character edits (insert, delete, substitute) needed to transform one string into another.
- Lineage (data lineage)
- The chain that connects every value on a golden record back to its source rows, sources, and the pipeline event that produced the current state.
M
- Master data management(MDM)
- The discipline of producing one trusted version of each business entity (customer, vendor, product) from scattered sources.
P
- Phonetic matching
- Matching strings that sound alike but spell differently — "Cathryn" vs "Katherine", "Schmidt" vs "Schmitt".
- PPRL(Privacy-Preserving Record Linkage)
- Cross-organization entity resolution where neither side has to share raw records with the other or with a third party.
- Precision and recall (in matching)
- Two complementary metrics: precision measures how many of your merges are correct; recall measures how many real duplicates you actually merged.
- Probabilistic matching
- Matching that assigns a numeric similarity score to each candidate pair, instead of a binary yes/no.
- Provenance
- A near-synonym of lineage, with extra emphasis on *who* and *when* — provenance is the audit chain of human + system actions, not just data flow.
R
- Record linkage
- A near-synonym of entity resolution, particularly in academic and healthcare contexts.
- Reference data
- The small, slow-changing lookup datasets that classify master data — country codes, currency codes, industry classifications, product categories.
S
- Schema inference
- Automatic proposal of a target-schema mapping for a new source — "this column looks like an email, this one looks like a date of birth."
- Schema mapping
- Translating each source's column names into the canonical target schema — "cust_name" + "customer_name" + "FullName" all become "name."
- Survival-rule bias
- The systematic effect of choosing one survivorship rule over another — e.g., "most recent" biases the golden record toward whichever source updates most often.
- Survivorship
- The rule set that decides which source value wins for each field on the golden record.