Dirty CSV to Golden Records

Profile, clean, match, and produce golden records from a raw CSV file using GoldenCheck, infermap, goldenmatch, and the review queue.

This guide takes a messy CSV file and walks through the full pipeline: profiling the data, mapping its schema, cleaning it up, deduplicating records, and reviewing the results.

1. Profile Your Data

Start by understanding what you're working with. GoldenCheck scans every column for completeness, format consistency, and anomalies.

goldencheck demo

import goldencheck
report = goldencheck.profile("customers_raw.csv")
report.summary()

Look at the output for null rates, duplicate counts, and format violations. These tell you where to focus your cleaning effort.

2. Map the Schema

Use infermap to automatically map your CSV columns to a canonical schema. This handles misnamed columns, different casing, and abbreviations.

infermap demo

import infermap
mapping = infermap.map_schema(
    source="customers_raw.csv",
    target_schema="canonical_customer"
)
mapping.show()

Review the suggested mappings and adjust any that look wrong before proceeding.

3. Clean the Data

Run GoldenPipe to standardize formats, fill gaps where possible, and apply the schema mapping from the previous step.

goldenpipe demo

import goldenpipe
cleaned = goldenpipe.transform(
    source="customers_raw.csv",
    mapping=mapping,
    rules=["standardize_phones", "normalize_addresses", "trim_whitespace"]
)
print(f"Cleaned {cleaned.record_count} records, {cleaned.fixes_applied} fixes applied")

4. Deduplicate

With clean, consistently-formatted data, GoldenMatch can now find duplicates accurately.

goldenmatch demo

import goldenmatch
result = goldenmatch.dedupe(
    "customers_cleaned.csv",
    threshold=0.90,
    fields=["name", "email", "address", "phone"]
)
print(f"Found {result.cluster_count} clusters from {result.record_count} records")

Tip: Start with a higher threshold (0.90+) on your first run. You can always lower it to catch more fuzzy matches, but starting high helps you verify the obvious duplicates first and build confidence in the results before broadening the net.

5. Review the Results

Use the interactive playground to inspect match clusters, compare field values side-by-side, and accept or reject proposed merges.

Upload a CSV of contacts, customers, or any list. We read your file, figure out the rules ourselves, and clean out the duplicates. No setup, no tuning.

Drop CSV here

Once you've reviewed the clusters, export the final golden records with one record per entity.

Was this page helpful?

Edit this page on GitHub

PreviousGuides NextEntity Resolution Workflow