Migrating from open-source dedupe to Golden Suite

For teams who've outgrown dedupe.io, Splink, or a homegrown Python pipeline — what stays the same, what gets easier.

If you're running an open-source dedup pipeline (dedupe.io, Splink, recordlinkage, or a homegrown pandas + rapidfuzz script), you've already proven the technical case for entity resolution. What you're missing is the operational surface: stewardship UI, audit trail, scheduled re-runs that survive infra changes, lineage for the audit team, etc.

Read our open-source comparison for the honest "where they win" — open-source tooling is genuinely excellent at the algorithm layer. Golden Suite's value-add is everything around it.

This is the easiest migration in this guide series, because the underlying engine is open-source goldenmatch — the MIT-licensed matcher you'd be using directly otherwise. Migrating means keeping the engine and adding the operational layer on top.

What carries over

  • Your matching logic. Pretty much directly. If you've tuned blocking keys + scorer weights in Splink, the same intuitions translate to goldenmatch's config (and goldenmatch is built on similar primitives — blocking, fuzzy scorers, cluster threshold).
  • Your training pairs (if any). Dedupe.io and Splink both produce labeled-pair artifacts. Same as the Tamr migration, use those as your F1 regression suite.
  • Your data plane. No re-extract needed — connect the same upstreams.

What gets easier

  • No more "where did this match come from?" investigations. Lineage UI walks any golden record back to source rows with full per-scorer breakdown.
  • No more "the cron failed silently three weeks ago". Arq worker + Railway monitoring + /admin/health + nightly F1 cron all surface failures immediately.
  • No more "the auditor wants the audit log". audit_log table is cryptographically chained, exportable, append-only.
  • No more "this pair is ambiguous, let me email Sarah". Review queues land ambiguous merges in a UI any non-engineer can decide on.

What changes

  • You're now running a hosted SaaS instead of python pipeline.py from cron. The trade-off: less infra to babysit, more dependency on us shipping reliably (see our observability page for what we publish).
  • Your scorer-weight tuning moves into the workbench UI. You can still drop into Python for one-off scripts (pip install goldenmatch), but day-to-day the workbench is where the work happens.
  • Pricing. From $0 (open-source) to $0 (Free tier, 3 sources / 1 concurrent job) or $99/mo (Pro) for the platform. The engine itself stays free.

Migration sequence

A realistic timeline: 1–2 weeks. This is the fastest migration in the guide series because there's no impedance mismatch with the underlying engine.

Week 1 — Lift-and-shift

  1. Sign up at bensevern.dev. Create a project, add your 1–3 most-trafficked sources.
  2. Run dedup. Golden Suite uses goldenmatch under the hood — same primitives as Splink/dedupe.io/recordlinkage.
  3. Compare output to your existing pipeline. Pick 100 records, diff the clusters. Differences usually trace to:
    • Different default scorer weights (tune in the autoconfig editor)
    • Different default blocking rules (tune in the same editor)
    • Your pipeline had post-processing logic (deduplication-of-deduplication, alias-table joining, etc.) that needs to move into a notebook event in Golden Suite

Week 2 — Operational layer

  1. Migrate your scheduled re-run from cron-and-pray to Arq + the workbench. Or just trigger it via the API if you have an external scheduler.
  2. Wire downstream consumers to the export endpoint instead of reading from your homegrown output CSV.
  3. Set up the audit + lineage features. New muscle to develop; spend a week using them on real records so you know where to look when an auditor asks.

That's it. There's no "Week 3+" — the migration is fundamentally about adding operational surface, not replacing the engine.

Common pitfalls

  • Don't migrate "to test" — migrate one source first, run parallel, get conviction. Migrating everything at once means no rollback.
  • Don't expect bit-for-bit parity with your old pipeline's output. Even though both use goldenmatch under the hood, the configuration will differ slightly (different defaults, different blocking choices). Compute F1 on your labels; if it's within 0.02 you're fine.
  • Don't drop your Python pipeline yet. Keep it running on the same data for 2–3 weeks as a sanity check.

When NOT to migrate

  • You're a research/data-science team where the value is in the open-source-ness of the pipeline — you fork it, run experiments, contribute back. Golden Suite is for operationalizing, not researching.
  • Your data plane is fundamentally tied to a notebook environment (Jupyter, Databricks) and the rest of your team doesn't use a workbench
  • You don't have an operational pain — your cron has been running fine for 3 years and nobody asks for lineage

For teams whose homegrown pipeline is no longer the best use of an engineer's time — and where the audit / stewardship / scheduling layer would actually unlock value — Golden Suite is the path of least resistance. You keep the engine; you delegate everything else.

Questions? Email ben@bensevern.dev or /enterprise.