Dirty CSV to Golden Records
Profile, clean, match, and produce golden records from raw CSV data.
This guide takes a messy CSV file and walks through the full pipeline: profiling the data, mapping its schema, cleaning it up, deduplicating records, and reviewing the results.
1. Profile Your Data
Start by understanding what you're working with. GoldenCheck scans every column for completeness, format consistency, and anomalies.
import goldencheck
report = goldencheck.profile("customers_raw.csv")
report.summary()Look at the output for null rates, duplicate counts, and format violations. These tell you where to focus your cleaning effort.
2. Map the Schema
Use infermap to automatically map your CSV columns to a canonical schema. This handles misnamed columns, different casing, and abbreviations.
import infermap
mapping = infermap.map_schema(
source="customers_raw.csv",
target_schema="canonical_customer"
)
mapping.show()Review the suggested mappings and adjust any that look wrong before proceeding.
3. Clean the Data
Run GoldenPipe to standardize formats, fill gaps where possible, and apply the schema mapping from the previous step.
import goldenpipe
cleaned = goldenpipe.transform(
source="customers_raw.csv",
mapping=mapping,
rules=["standardize_phones", "normalize_addresses", "trim_whitespace"]
)
print(f"Cleaned {cleaned.record_count} records, {cleaned.fixes_applied} fixes applied")4. Deduplicate
With clean, consistently-formatted data, GoldenMatch can now find duplicates accurately.
import goldenmatch
result = goldenmatch.dedupe(
"customers_cleaned.csv",
threshold=0.90,
fields=["name", "email", "address", "phone"]
)
print(f"Found {result.cluster_count} clusters from {result.record_count} records")Tip: Start with a higher threshold (0.90+) on your first run. You can always lower it to catch more fuzzy matches, but starting high helps you verify the obvious duplicates first and build confidence in the results before broadening the net.
5. Review the Results
Use the interactive playground to inspect match clusters, compare field values side-by-side, and accept or reject proposed merges.
Deduplicate records with interactive config tuning. Upload a CSV, tweak match thresholds and field weights, and watch clusters form in real-time.
Drop CSV here
Once you've reviewed the clusters, export the final golden records with one record per entity.