2026-05-23/Ben Severn

Pipe Your SaaS Data to Your Warehouse: A Funnel That Doesn't Own It

Most MDM tools want to be your source of truth — you query their store. bensevern.dev inverts that: it's a matching funnel between your SaaS sources and your warehouse, then hands the data back. Here's what that looks like end-to-end.

python mdm data-pipeline entity-resolution warehouse etl

If you've ever sat in a Reltio sales call, the pitch is some version of "we'll hold your customer record, and your downstream systems will sync from us." That's the model. They sit in the middle, they own the master record, you trust their lineage.

It works at Fortune-500 scale. It also costs $300k+/yr, takes 6-18 months to implement, and leaves your customer data inside a vendor's database. For the 99% of companies that aren't a Fortune 500 — RevOps teams reconciling HubSpot + Salesforce + Stripe, event-ops teams merging Cvent + Bizzabo attendees, PE platforms integrating post-acquisition CRM files — that pitch is overkill and the lock-in is real.

bensevern.dev inverts the shape: instead of being your customer-data source of truth, it's a funnel between your SaaS sources and the warehouse you already own. We collect, match, surface ambiguous merges in a review queue, and push the resulting golden records back. The warehouse stays the source of truth; we're the filter. If we go away tomorrow, the last good push is still in your warehouse table.

This post walks the full five-stage funnel end-to-end.

The funnel, in five stages

  ┌─────────┐    ┌──────────┐    ┌────────┐    ┌────────┐    ┌──────────┐
  │ Sources │ -> │Autoconfig│ -> │ Match  │ -> │ Review │ -> │   Push   │
  │  (in)   │    │          │    │ + cluster│  │ queue  │    │  (out)   │
  └─────────┘    └──────────┘    └────────┘    └────────┘    └──────────┘

Each stage is a separate concern with a separate interface. You can stop at any stage and inspect what happened, replay it, or fork it. The matching engine (goldenmatch) is MIT-licensed on PyPI, so if the SaaS goes sideways you still have the engine.

Stage 1 — Ingest from wherever your data lives

Sources supports 22 connectors today: HubSpot, Salesforce, Stripe, Pipedrive, Klaviyo, Shopify, Intercom, Zendesk, Airtable, Mailchimp, and the rest of the modern SaaS surface, plus SQL (Postgres, MySQL, Snowflake, BigQuery), cloud files (S3, GCS, Azure Blob, SFTP), Google Sheets, and OAuth-backed (Microsoft Graph, Google Contacts, Salesforce).

Each source is a row in a golden_sources table with an entity_type — typically person or account. Sources sharing an entity_type get pooled, so HubSpot + Salesforce contacts deduplicate against each other automatically.

# CLI sketch — most users do this via the /golden/sources UI
$ POST /api/golden/sources
{
  "name": "HubSpot CRM",
  "source_type": "hubspot_contacts",
  "entity_type": "person"
}

CDC happens via a cursor_column (e.g. updated_at) — each ingest only pulls rows updated since the last successful run.

Stage 2 — Autoconfig: no DSL to learn first

The biggest difference between us and the older MDM tools is multi-wave autoconfig in goldenmatch 1.18. On the first dispatch against a new source, the engine inspects your schema and proposes:

Which columns to block on (high-cardinality identifiers — email, phone, federal_tax_id, etc.)
Per-field scorers (name → token Jaccard, email → exact, address → token-sort, phone → normalized E.164)
A clustering threshold tuned to the data shape
Survivorship rules — which source wins on each field

You don't need to write a config file before your first useful run. The autoconfig is right for most cases; you override what you don't like.

import goldenmatch as gm
result = gm.dedupe_df(df)  # no config — autoconfig handles it
print(f"{result.cluster_count} clusters from {result.record_count} records")
print(f"Ambiguous (needs review): {len(result.ambiguous_merges)}")

For a typical 100k-row CRM contact pool, the first autoconfig pass takes ~30 seconds and gets you to a working match quality on the first run. You iterate from there.

Stage 3 — Match + cluster

goldenmatch runs the full pipeline:

Block — group records that share a high-confidence key
Score — for each candidate pair, run the configured scorers
Cluster — Union-Find on pairs above the threshold
Survive — pick the surviving value per field per cluster

The output is a set of golden records, one per cluster, with full lineage back to every source record that contributed.

A typical postflight report:

{
  "clusters": 18420,
  "input_records": 24130,
  "auto_merged": 17891,
  "ambiguous_merges": [
    {
      "cluster_id": "cl_4f8a",
      "confidence": 0.71,
      "members": [...],
      "demoted_scorers": ["address_token_sort"]
    },
    ...
  ]
}

auto_merged clusters are the ones the engine is confident about — they go straight to the golden table. ambiguous_merges are the ones it isn't — those land in the review queue.

Stage 4 — Review queue: a human decides the ambiguous ones

The Review queue shows every cluster the matcher punted on, with the full member list and the reason it was demoted (which scorer fired below threshold, what fields disagreed). A steward — usually a non-engineer, sales-ops or customer-success type — approves, splits, or merges.

Each decision is recorded in the audit trail and propagates to the next destination push. Roadmap: decisions will additionally feed back into the scorer's field-rules layer so future runs auto-merge similar clusters without re-deciding (issue #135). Right now decisions stop at the audit log + the next export, which is honest about what's shipped.

Stage 5 — Push to where your data already lives

This is where we differ most sharply from the older MDM tools. Destinations supports 7 outbound targets in v1:

Family	Targets
Warehouses	Postgres · MySQL · Snowflake · BigQuery
Cloud file	S3 · GCS · Azure Blob (CSV or Parquet — inferred from URL suffix)
Browser	One-shot CSV · Excel · Parquet download

Pick overwrite (TRUNCATE then bulk load) or append. Schedule it for every 5m / 15m / 1h / 6h / 1d via the per-destination schedule panel, or trigger manually. Every push lands in the audit_log chain and creates a destination_runs history row.

The push goes to your Postgres / Snowflake / BigQuery / S3 — not ours. We never become a copy of your source-of-truth store. The encrypted connection string + credentials live on the destination row; only the runtime decrypts during the actual push, and even the API responses mask the password (postgres://user:***@host).

// from the typed client at lib/destinations.ts
import { runDestination } from '@/lib/destinations'

const result = await runDestination(token, warehouseDestId)
console.log(`wrote ${result.rows_written} rows to ${result.target}`)

Why "we don't host your data" is enforceable, not just marketing

We back up the claim with per-org retention policies (issue #151). An admin sets retention_days per org via POST /api/admin/retention/{org_id}; the scheduler purges golden_store.raw_records older than that threshold once per 24h. Entities + lineage + audit log stay — those are derived state, useful for trend analysis after the raw PII expires.

Opt-in, not opt-out — existing orgs see no behavior change. Customers who specifically want the raw-data-doesn't-linger guarantee turn it on; everyone else can keep the data as long as they want.

What this isn't

To be clear about scope:

Not a Fortune-500 MDM replacement. If you need 30-year-old mainframe connectors, deep ESB integration, and a SOC2 Type 2 report today, Reltio and Informatica are still the answer. We're SOC2-aligned with attestation in progress; mainframe connectors are out of scope.
Not a homegrown-pipeline killer. If your matching is one CSV cleaned monthly by one engineer, build the pipeline yourself. Don't over-engineer.
Not real-time streaming match. Batch only. The scheduler can drain every 5m, which is close-to-real-time for most use cases, but it's not Kafka-stream-shaped.

But for the cases in between — RevOps teams with two CRMs, event ops with three attendee systems, PE platforms with five acquired CRMs to merge — the funnel is the right shape. Start free on 3 sources + 2 destinations, or read the getting-started docs.

Repo

github.com/benseverndev-oss/golden-showcase — open the issues tab to see what's shipped this month and what's queued.

← Back to blog