Dirty CSV to Golden Records

Profile, clean, match, and produce golden records from raw CSV data.

This guide takes a messy CSV file and walks through the full pipeline: profiling the data, mapping its schema, cleaning it up, deduplicating records, and reviewing the results.

1. Profile Your Data

Start by understanding what you're working with. GoldenCheck scans every column for completeness, format consistency, and anomalies.

goldencheck demo
import goldencheck
report = goldencheck.profile("customers_raw.csv")
report.summary()

Look at the output for null rates, duplicate counts, and format violations. These tell you where to focus your cleaning effort.

2. Map the Schema

Use infermap to automatically map your CSV columns to a canonical schema. This handles misnamed columns, different casing, and abbreviations.

infermap demo
import infermap
mapping = infermap.map_schema(
    source="customers_raw.csv",
    target_schema="canonical_customer"
)
mapping.show()

Review the suggested mappings and adjust any that look wrong before proceeding.

3. Clean the Data

Run GoldenPipe to standardize formats, fill gaps where possible, and apply the schema mapping from the previous step.

goldenpipe demo
import goldenpipe
cleaned = goldenpipe.transform(
    source="customers_raw.csv",
    mapping=mapping,
    rules=["standardize_phones", "normalize_addresses", "trim_whitespace"]
)
print(f"Cleaned {cleaned.record_count} records, {cleaned.fixes_applied} fixes applied")

4. Deduplicate

With clean, consistently-formatted data, GoldenMatch can now find duplicates accurately.

goldenmatch demo
import goldenmatch
result = goldenmatch.dedupe(
    "customers_cleaned.csv",
    threshold=0.90,
    fields=["name", "email", "address", "phone"]
)
print(f"Found {result.cluster_count} clusters from {result.record_count} records")

Tip: Start with a higher threshold (0.90+) on your first run. You can always lower it to catch more fuzzy matches, but starting high helps you verify the obvious duplicates first and build confidence in the results before broadening the net.

5. Review the Results

Use the interactive playground to inspect match clusters, compare field values side-by-side, and accept or reject proposed merges.

Deduplicate records with interactive config tuning. Upload a CSV, tweak match thresholds and field weights, and watch clusters form in real-time.

Drop CSV here

Once you've reviewed the clusters, export the final golden records with one record per entity.