Migrating from open-source dedupe to Golden Suite
For teams who've outgrown dedupe.io, Splink, or a homegrown Python pipeline — what stays the same, what gets easier.
If you're running an open-source dedup pipeline (dedupe.io, Splink, recordlinkage, or a homegrown pandas + rapidfuzz script), you've already proven the technical case for entity resolution. What you're missing is the operational surface: stewardship UI, audit trail, scheduled re-runs that survive infra changes, lineage for the audit team, etc.
Read our open-source comparison for the honest "where they win" — open-source tooling is genuinely excellent at the algorithm layer. Golden Suite's value-add is everything around it.
This is the easiest migration in this guide series, because the underlying engine is open-source goldenmatch — the MIT-licensed matcher you'd be using directly otherwise. Migrating means keeping the engine and adding the operational layer on top.
What carries over
- Your matching logic. Pretty much directly. If you've tuned blocking keys + scorer weights in Splink, the same intuitions translate to goldenmatch's config (and goldenmatch is built on similar primitives — blocking, fuzzy scorers, cluster threshold).
- Your training pairs (if any). Dedupe.io and Splink both produce labeled-pair artifacts. Same as the Tamr migration, use those as your F1 regression suite.
- Your data plane. No re-extract needed — connect the same upstreams.
What gets easier
- No more "where did this match come from?" investigations. Lineage UI walks any golden record back to source rows with full per-scorer breakdown.
- No more "the cron failed silently three weeks ago". Arq worker + Railway monitoring +
/admin/health+ nightly F1 cron all surface failures immediately. - No more "the auditor wants the audit log".
audit_logtable is cryptographically chained, exportable, append-only. - No more "this pair is ambiguous, let me email Sarah". Review queues land ambiguous merges in a UI any non-engineer can decide on.
What changes
- You're now running a hosted SaaS instead of
python pipeline.pyfrom cron. The trade-off: less infra to babysit, more dependency on us shipping reliably (see our observability page for what we publish). - Your scorer-weight tuning moves into the workbench UI. You can still drop into Python for one-off scripts (
pip install goldenmatch), but day-to-day the workbench is where the work happens. - Pricing. From $0 (open-source) to $0 (Free tier, 3 sources / 1 concurrent job) or $99/mo (Pro) for the platform. The engine itself stays free.
Migration sequence
A realistic timeline: 1–2 weeks. This is the fastest migration in the guide series because there's no impedance mismatch with the underlying engine.
Week 1 — Lift-and-shift
- Sign up at bensevern.dev. Create a project, add your 1–3 most-trafficked sources.
- Run dedup. Golden Suite uses goldenmatch under the hood — same primitives as Splink/dedupe.io/recordlinkage.
- Compare output to your existing pipeline. Pick 100 records, diff the clusters. Differences usually trace to:
- Different default scorer weights (tune in the autoconfig editor)
- Different default blocking rules (tune in the same editor)
- Your pipeline had post-processing logic (deduplication-of-deduplication, alias-table joining, etc.) that needs to move into a notebook event in Golden Suite
Week 2 — Operational layer
- Migrate your scheduled re-run from cron-and-pray to Arq + the workbench. Or just trigger it via the API if you have an external scheduler.
- Wire downstream consumers to the export endpoint instead of reading from your homegrown output CSV.
- Set up the audit + lineage features. New muscle to develop; spend a week using them on real records so you know where to look when an auditor asks.
That's it. There's no "Week 3+" — the migration is fundamentally about adding operational surface, not replacing the engine.
Common pitfalls
- Don't migrate "to test" — migrate one source first, run parallel, get conviction. Migrating everything at once means no rollback.
- Don't expect bit-for-bit parity with your old pipeline's output. Even though both use goldenmatch under the hood, the configuration will differ slightly (different defaults, different blocking choices). Compute F1 on your labels; if it's within 0.02 you're fine.
- Don't drop your Python pipeline yet. Keep it running on the same data for 2–3 weeks as a sanity check.
When NOT to migrate
- You're a research/data-science team where the value is in the open-source-ness of the pipeline — you fork it, run experiments, contribute back. Golden Suite is for operationalizing, not researching.
- Your data plane is fundamentally tied to a notebook environment (Jupyter, Databricks) and the rest of your team doesn't use a workbench
- You don't have an operational pain — your cron has been running fine for 3 years and nobody asks for lineage
For teams whose homegrown pipeline is no longer the best use of an engineer's time — and where the audit / stewardship / scheduling layer would actually unlock value — Golden Suite is the path of least resistance. You keep the engine; you delegate everything else.
Questions? Email ben@bensevern.dev or /enterprise.