GoldenMatch

Probabilistic record linkage with configurable field weights and blocking strategies.

GoldenMatch performs probabilistic record linkage — it compares records across multiple fields, scores similarity, and groups matches into clusters.

Basic Usage

import goldenmatch

result = goldenmatch.dedupe("data.csv", threshold=0.85)
print(result.clusters)

Configuration

Field Weights

Control how much each field contributes to the overall match score:

result = goldenmatch.dedupe(
    "data.csv",
    threshold=0.85,
    weights={"name": 0.4, "address": 0.3, "email": 0.2, "phone": 0.1}
)

Blocking Strategy

Blocking reduces the number of comparisons by only comparing records that share a blocking key:

result = goldenmatch.dedupe(
    "data.csv",
    threshold=0.85,
    blocking_keys=["zip_code", "first_letter_last_name"]
)

Interactive Playground

Tune thresholds and field weights visually:

Deduplicate records with interactive config tuning. Upload a CSV, tweak match thresholds and field weights, and watch clusters form in real-time.

Drop CSV here

Output Format

FieldTypeDescription
cluster_idstringUnique cluster identifier
record_idstringOriginal record identifier
scorefloatMatch confidence (0.0 - 1.0)
cluster_sizeintNumber of records in the cluster

Note: Records below the threshold are assigned to singleton clusters.