GoldenMatch
Probabilistic record linkage with configurable field weights and blocking strategies.
GoldenMatch performs probabilistic record linkage — it compares records across multiple fields, scores similarity, and groups matches into clusters.
Basic Usage
import goldenmatch
result = goldenmatch.dedupe("data.csv", threshold=0.85)
print(result.clusters)
Configuration
Field Weights
Control how much each field contributes to the overall match score:
result = goldenmatch.dedupe(
"data.csv",
threshold=0.85,
weights={"name": 0.4, "address": 0.3, "email": 0.2, "phone": 0.1}
)
Blocking Strategy
Blocking reduces the number of comparisons by only comparing records that share a blocking key:
result = goldenmatch.dedupe(
"data.csv",
threshold=0.85,
blocking_keys=["zip_code", "first_letter_last_name"]
)
Interactive Playground
Tune thresholds and field weights visually:
Deduplicate records with interactive config tuning. Upload a CSV, tweak match thresholds and field weights, and watch clusters form in real-time.
Drop CSV here
Output Format
| Field | Type | Description |
|---|---|---|
cluster_id | string | Unique cluster identifier |
record_id | string | Original record identifier |
score | float | Match confidence (0.0 - 1.0) |
cluster_size | int | Number of records in the cluster |
Note: Records below the threshold are assigned to singleton clusters.