Clustering (in deduplication)
The final entity-resolution stage where high-confidence pairs are grouped into connected components — one cluster per real-world entity.
Scoring produces pairs ("A and B are 0.92 similar"). Clustering produces groups (`{A, B, C}` all refer to the same customer).
The standard approach is: build a graph where every high-confidence pair is an edge, then find connected components. So if A matches B and B matches C, the cluster `{A, B, C}` forms even if A and C were never directly compared (because they didn't share a blocking signal).
Clusters can be: - Confident — every pair inside scored above the auto-merge threshold. Merged automatically. - Ambiguous — some pairs scored in the middle band. Flagged in the review queue. - Rejected — at least one pair below the merge threshold. Cluster splits.
The ambiguous bucket is what makes MDM stewardship a real job.