Glossary

MDM & Entity Resolution Glossary

30 terms. Short definitions — depth lives in /docs/concepts and /docs/guides.

A

Ambiguous merge
A cluster of records that the matching engine flagged as "probably the same entity, but confidence is in the middle band — please review."
Audit log
An append-only record of every consequential action taken on master data — rule changes, manual merges, approved splits, exports.

B

Blocking
The first stage of entity resolution — narrowing the candidate-pair space so we only compare records that share some easy-to-compute signal.

C

CLK(Cryptographic Linkage Key)
A Bloom-filter encoding of a record's identifying fields, used to perform record linkage without exchanging raw data.
Clustering (in deduplication)
The final entity-resolution stage where high-confidence pairs are grouped into connected components — one cluster per real-world entity.

D

Data profiling
Generating a statistical summary of a dataset — value distributions, null rates, cardinality, format patterns — before doing anything with it.
Data quality
How fit a dataset is for its intended use — measured along completeness, validity, consistency, uniqueness, timeliness, and accuracy.
Data standardization
Cleaning and normalizing field values to a canonical form before matching — phone E.164, address USPS-standard, name title-case.
Data stewardship
The human discipline of curating, approving, and correcting master data — the role that owns "is the customer record actually right?"
Deterministic matching
Matching based on exact equality of one or more shared identifiers — same SSN, same email, same tax ID.

E

Entity resolution(ER)
The process of identifying which records across one or more datasets refer to the same real-world entity.

F

F1 score
The harmonic mean of precision and recall — the standard single-number quality metric for entity resolution engines.
Fuzzy matching
Comparing two values that are similar but not exactly equal — typos, capitalization, word order, missing punctuation.

G

Golden record
A synthesized canonical record produced by merging the contributing source rows for one real-world entity.

I

Incremental matching
Re-running entity resolution on a source that grew by a small amount without re-scoring the entire dataset.

J

Jaro-Winkler similarity
A string similarity score (0-1) that favors matches with identical prefixes — well-suited to person and company names.

L

Levenshtein distance
A similarity metric that counts the minimum single-character edits (insert, delete, substitute) needed to transform one string into another.
Lineage (data lineage)
The chain that connects every value on a golden record back to its source rows, sources, and the pipeline event that produced the current state.

M

Master data management(MDM)
The discipline of producing one trusted version of each business entity (customer, vendor, product) from scattered sources.

P

Phonetic matching
Matching strings that sound alike but spell differently — "Cathryn" vs "Katherine", "Schmidt" vs "Schmitt".
PPRL(Privacy-Preserving Record Linkage)
Cross-organization entity resolution where neither side has to share raw records with the other or with a third party.
Precision and recall (in matching)
Two complementary metrics: precision measures how many of your merges are correct; recall measures how many real duplicates you actually merged.
Probabilistic matching
Matching that assigns a numeric similarity score to each candidate pair, instead of a binary yes/no.
Provenance
A near-synonym of lineage, with extra emphasis on *who* and *when* — provenance is the audit chain of human + system actions, not just data flow.

R

Record linkage
A near-synonym of entity resolution, particularly in academic and healthcare contexts.
Reference data
The small, slow-changing lookup datasets that classify master data — country codes, currency codes, industry classifications, product categories.

S

Schema inference
Automatic proposal of a target-schema mapping for a new source — "this column looks like an email, this one looks like a date of birth."
Schema mapping
Translating each source's column names into the canonical target schema — "cust_name" + "customer_name" + "FullName" all become "name."
Survival-rule bias
The systematic effect of choosing one survivorship rule over another — e.g., "most recent" biases the golden record toward whichever source updates most often.
Survivorship
The rule set that decides which source value wins for each field on the golden record.