2026-05-15/Ben Severn

The honest math on building your own MDM

A realistic person-month estimate for building an MDM platform in-house: engine, pipeline, workbench, audit, survivorship, connectors. Plus the year-2 maintenance cost nobody plans for.

mdmbuild-vs-buyengineering-managementdata-engineering

Every six months or so, somebody asks me whether they should just build their own MDM platform.

The conversation usually starts something like this: "We got a quote from Reltio for $80k a year. That's two engineers for six months. We have two engineers. Why don't we just build it?"

The math is tempting. The math is also wrong, and it's wrong in ways that are systematically underestimated. This post is the honest version, drawn from having actually built the thing twice (once as a contractor, once as the founder of an MDM SaaS) and watched maybe a dozen other teams try it.

I'll walk through the actual components, give realistic person-month estimates, and then talk about the year-2 cost that nobody plans for. By the end you should be able to do the build-vs-buy math for your own situation without flattering yourself in either direction.

The components you don't think about

If you've never built an MDM platform, you probably think of it as "the matching algorithm." That's the most visible piece, so it's the piece you imagine when you imagine the build.

The matching algorithm is roughly 15% of the work.

Here's what's actually in an MDM platform:

  1. The matching engine. Blocking, scoring, clustering. The visible part.
  2. The data pipeline. Ingestion from N source systems, scheduling, retries, partial loads.
  3. The workbench / UI. Where stewards review ambiguous matches, approve/split/merge, configure rules.
  4. Survivorship. Field-by-field rules for which value wins when records merge.
  5. The audit trail. Tamper-evident logs of every match, merge, correction, override.
  6. Lineage. Tracing a golden record back to the source rows that contributed.
  7. Connector library. Salesforce, HubSpot, Stripe, Postgres, Snowflake, BigQuery, S3, the dozen-plus systems your customer data lives in.
  8. Multi-tenancy / scoping. Permissions, org structure, environments.
  9. Quality monitoring. Drift detection, F1 tracking over time, alerting when match quality regresses.
  10. Deployment + ops. Containers, IaC, secrets, monitoring, on-call.

Each of those is a system. Some are bigger than others. None are free.

Realistic person-month estimates

I'll give ranges. The low end assumes a senior engineer with prior experience in the relevant domain. The high end assumes a competent generalist who has to learn as they go. Most teams land in the middle.

ComponentPerson-months (range)
Matching engine3-6
Data pipeline2-4
Workbench UI4-8
Survivorship1-2
Audit trail2-3
Lineage1-2
Connector library (per source × 8 sources)8-16
Multi-tenancy / scoping1-2
Quality monitoring1-2
Deployment + ops1-2
Total (v1, minimum viable)24-47 person-months

Let me defend each number briefly, because these always provoke pushback from teams who've never tried.

Matching engine, 3-6 months. The naive version of blocking + scoring + Union-Find is a one-week exercise. The version that actually performs well on real data — with adaptive blocking, scorer ensembles, threshold tuning, postflight reporting on ambiguous merges — is months. The matching engine is where I see the most "I'll just build the algorithm in an afternoon" optimism. The afternoon-version produces an F1 around 0.5. The version that lands at 0.8+ on a real benchmark takes months.

Pipeline, 2-4 months. Source-by-source ingestion is fine for week 1. Then someone asks "can we incrementally load instead of full-refresh?" and you spend a month on watermark logic. Then someone asks "what happens when an upstream API rate-limits us?" and you spend three weeks on retry and backoff. Then someone asks "can we replay yesterday's pipeline?" and you spend a month on event-sourcing your pipeline state.

Workbench UI, 4-8 months. This is the biggest single under-estimate. A "good enough" workbench has: a review queue, a cluster-detail view with side-by-side field comparison, a survivorship-rule editor that non-engineers can use, an audit-log viewer with filtering, a connector-management UI, a permissions/scoping UI. Each of those is a 2-4 week piece on its own. Six months is the low end if you have a competent front-end engineer.

Connector library, 8-16 months total. I keep meeting teams who say "we'll just write one connector at a time as we need them." That's true. They take 1-2 weeks each, and you'll write 8 of them in year 1 because every new department wants their source ingested. Two months × 8 sources = 16 person-months. The hosted MDM vendors all ship with 30-100 connectors out of the box because they've been writing them for a decade.

Audit + lineage, 3-5 months combined. If you're in any regulated environment, "an audit chain" doesn't mean "a log table." It means tamper-evident logs with cryptographic hashing, per-row provenance walkable from golden record back to source, immutable storage, compliance-ready export. Your auditors will ask. You will spend 3-5 months getting this right.

Total at the low end: 24 person-months. At the high end: 47. With two engineers, that's a 12-24 month project to ship v1.

The year-2 cost nobody plans for

The build estimate is honest. The year-2 estimate is where the wheels actually come off.

In year 2 you have:

Year-2 cost: 0.5-1 full-time engineer dedicated to maintenance. That's $150k-$300k/year in loaded cost. Forever.

The true cost over 3 years

Conservative version, two-engineer team, 24-month build, 0.75 FTE maintenance ongoing:

For comparison:

The math isn't even close. Building MDM in-house is one of the worst ROI engineering decisions a mid-market company can make, and it's also one of the most common.

When building actually wins

I want to be fair about this because the post that pretends building is never right is a sales pitch. Building is genuinely the right call when:

Even when building wins, the right shape is usually "build on top of an open-source engine, don't build the engine." goldenmatch, dedupe, splink — these are all MIT-licensed engines you can drop in. Build the operating surface around them; don't rebuild the matching algorithm.

When buying actually wins

Buying wins for the boring middle case. You have 50k-2M customer records. You have 3-6 source systems. You have one data engineer and an analytics person. You have a real business problem (attribution / pipeline reporting / billing reconciliation) that the duplicates are blocking. You don't have 24 months to wait for a custom build.

For this case, the buy-vs-build math is basically settled: buy. The only question is which tier (see the Reltio alternatives post for the tier breakdown).

The argument the "build" side usually makes against this is "but vendors will lock us in." That's true, and it's also true of any database, ORM, cloud provider, CI system, or analytics tool you currently use. Vendor lock-in is a cost, not a disqualifier. The way to mitigate it is to pick a vendor whose data model is exportable and whose pricing is predictable — not to refuse to use any vendor at all.

What to do this week

If you're currently in a build-vs-buy debate:

  1. Do the honest math. Use the table above. Multiply by your actual loaded cost per engineer, not the discounted number. Don't subtract "but they're already employed" — opportunity cost is a real cost.
  2. Pilot the buy side. Pick one or two vendors at the tier that fits your size. Most have free trials or free tiers. Run a one-week PoC on your actual data. See if the match quality is good enough.
  3. Only build if the pilot visibly fails. "Fails" means the vendor can't handle your specific edge cases, not that the UI isn't perfect. Mid-tier MDM tools have the same match quality as enterprise tools at this point — the gap is operating surface, not engine.
  4. If you build, build on top of an OSS engine. goldenmatch, dedupe, splink. Whichever you pick, you save 3-6 months on the engine itself and you stay close to a community that's solving the same problems.

The honest answer to "should we build our own MDM" is almost always no. The honest answer to "should we build all of our own MDM" is almost always no even when "build" wins — the engine is open source already, and building it again is the wrong place to spend the bandwidth.

Golden Suite is the open-source MDM workbench I build, exactly because I think the boring middle case (50k-2M records, 3-6 sources, one data engineer) is the worst-served buyer in the MDM market. Try the demo or read the build-vs-buy comparison.