Threshold tuning for entity resolution
Precision and recall trade against each other through the match threshold. How to tune for your data, and what each direction costs.
Every entity-resolution pipeline has at least one threshold. It is the number that separates "merge these records" from "send to review" or "keep separate." Picking it is the single highest-leverage tuning decision in an ER project, and it is almost always done wrong on the first pass.
The precision/recall tradeoff
| Threshold direction | Precision | Recall | What you get |
|---|---|---|---|
| Lower (e.g. 0.70) | Drops | Rises | More merges, more false positives, more wrong-entity contamination |
| Higher (e.g. 0.95) | Rises | Drops | Cleaner golden records, more duplicates surviving in the output |
Precision is the rate at which the matches you make are correct. Recall is the rate at which the matches you should make actually fire. They trade. There is no threshold value that maximizes both.
Move the slider on a real string pair and watch the score change:
Deduplicate records with interactive config tuning. Upload a CSV, tweak match thresholds and field weights, and watch clusters form in real-time.
Drop CSV here
Tuning in practice
Start with the autoconfig threshold and run a sample of 500 pairs from the middle of the score distribution past a reviewer. The reviewer's judgment becomes your ground truth. Calculate precision and recall at three thresholds (0.80, 0.85, 0.90) against that ground truth and pick the one that matches the cost shape of your downstream use case.
Two questions tell you which direction to lean:
- What does a false positive cost? If wrong-entity contamination breaks billing, regulatory filings, or patient safety: lean high. The cost of an extra review is small compared to the cost of a bad merge.
- What does a false negative cost? If duplicate records cause downstream chaos (duplicate outreach, double-billing, broken analytics): lean low. The cost of reviewing an ambiguous merge is small compared to the cost of letting duplicates through.
Most real systems run two thresholds: a high one for auto-merge and a lower one for "send to review queue." Pairs between them surface for human judgment.
In Golden Suite
The workbench exposes the threshold as a slider with live precision/recall feedback against your sample. The postflight report shows the score distribution and where each cluster landed against the threshold. Audit history records every threshold change so you can correlate a quality regression with the day someone moved the slider.