Glossary

MDM & Entity Resolution Glossary

30 terms. Short definitions — depth lives in /docs/concepts and /docs/guides.

A

Ambiguous merge: A cluster of records that the matching engine flagged as "probably the same entity, but confidence is in the middle band — please review."
Audit log: An append-only record of every consequential action taken on master data — rule changes, manual merges, approved splits, exports.

Blocking: The first stage of entity resolution — narrowing the candidate-pair space so we only compare records that share some easy-to-compute signal.

CLK(Cryptographic Linkage Key): A Bloom-filter encoding of a record's identifying fields, used to perform record linkage without exchanging raw data. The standard primitive behind PPRL.
Clustering (in deduplication): The final entity-resolution stage where high-confidence pairs are grouped into connected components — one cluster per real-world entity.

Data profiling: Generating a statistical summary of a dataset — value distributions, null rates, cardinality, format patterns — before doing anything with it.
Data quality: How fit a dataset is for its intended use — measured along completeness, validity, consistency, uniqueness, timeliness, and accuracy.
Data standardization: Cleaning and normalizing field values to a canonical form before matching — phone E.164, address USPS-standard, name title-case.
Data stewardship: The human discipline of curating, approving, and correcting master data — the role that owns "is the customer record actually right?"
Deterministic matching: Matching based on exact equality of one or more shared identifiers (same SSN, same email, same tax ID). Fast and explainable; brittle when identifiers are missing.

Entity resolution(ER): The process of identifying which records across one or more datasets refer to the same real-world entity. Foundational to dedup, MDM, and customer-360 work.

F1 score: The harmonic mean of precision and recall. The standard single-number quality metric for entity resolution engines, tracked per Suite version on a benchmark fixture.
Fuzzy matching: Comparing two values that are similar but not exactly equal: typos, capitalization, word order, missing punctuation. The workhorse of probabilistic matching.

Golden record: A synthesized canonical record produced by merging the contributing source rows for one real-world entity, with per-field survivorship and full source lineage.

Incremental matching: Re-running entity resolution on a source that grew by a small amount without re-scoring the entire dataset. Cheaper than full resolve; risks missing reshapes.

Jaro-Winkler similarity: A string similarity score (0 to 1) that favors matches with identical prefixes. Well-suited to person and company names where typos rarely occur at the start.

Levenshtein distance: A similarity metric that counts the minimum single-character edits (insert, delete, substitute) needed to transform one string into another.
Lineage (data lineage): The chain that connects every value on a golden record back to its source rows, sources, and the pipeline event that produced the current state.

Master data management(MDM): The discipline of producing one trusted version of each business entity (customer, vendor, product) from scattered sources.

Phonetic matching: Matching strings that sound alike but spell differently: "Cathryn" vs "Katherine", "Schmidt" vs "Schmitt". Often used as a blocking signal, not a final scorer.
PPRL(Privacy-Preserving Record Linkage): Cross-organization entity resolution where neither side has to share raw records with the other or with a third party. Standard practice in healthcare and finance.
Precision and recall (in matching): Two complementary metrics: precision measures how many of your merges are correct; recall measures how many real duplicates you actually merged.
Probabilistic matching: Matching that assigns a numeric similarity score (0 to 1) to each candidate pair, instead of a binary yes/no. The modern default for record linkage.
Provenance: A near-synonym of lineage, with extra emphasis on *who* and *when* — provenance is the audit chain of human + system actions, not just data flow.

Record linkage: A near-synonym of entity resolution, particularly in academic and healthcare contexts. Same underlying problem and techniques, different community vocabulary.
Reference data: The small, slow-changing lookup datasets that classify master data — country codes, currency codes, industry classifications, product categories.

Schema inference: Automatic proposal of a target-schema mapping for a new source — "this column looks like an email, this one looks like a date of birth."
Schema mapping: Translating each source's column names into the canonical target schema — "cust_name" + "customer_name" + "FullName" all become "name."
Survival-rule bias: The systematic effect of choosing one survivorship rule over another — e.g., "most recent" biases the golden record toward whichever source updates most often.
Survivorship: The rule set that decides which source value wins for each field on the golden record. Configured per-field, not per-record, and reviewed as data changes.