Blog

Writing

Articles on data quality, schema mapping, and Python data engineering.

2026-06-20New

How to match two lists without either side seeing the other (privacy-preserving record linkage)

A technical look at how the /cleanroom tool finds the overlap between two lists using Bloom-filter fingerprints (CLKs), why the server can't reverse them, and how the browser tier keeps raw data on your device.

privacy-preserving-record-linkagepprlbloom-filterrecord-linkageclean-roomgoldenmatch

2026-06-19New

Why Knowledge Graphs Live or Die on Entity Resolution

A knowledge graph is only as good as its entities. Why bad entity resolution wrecks KG quality and cost, and how GoldenMatch solves the node layer.

knowledge-graphentity-resolutiongraphragdata-qualitygoldenmatch

2026-05-23New

Pipe Your SaaS Data to Your Warehouse: A Funnel That Doesn't Own It

Most MDM tools want to be your source of truth — you query their store. bensevern.dev inverts that: it's a matching funnel between your SaaS sources and your warehouse, then hands the data back. Here's what that looks like end-to-end.

pythonmdmdata-pipelineentity-resolutionwarehouseetl

2026-05-21New

SBOM scanning with three-state verdicts beats AFFECTED/NOT_AFFECTED

check_affected.py takes a CycloneDX SBOM and answers 'am I affected at version X?' with AFFECTED / NOT_AFFECTED / UNKNOWN — and shows you the interval that decided each verdict.

sbomvulnerabilitycyclonedxpackage-scanneruniversgoldenmatch

2026-05-19New

Only 7.5% of CVEs are expressible in package coordinates

Package scanners aren't missing KEV by accident. KEV-with-ransomware is structurally less package-representable than the baseline corpus.

entity-resolutiongoldenmatchsecurityvulnerabilitiescisa-kevepss

2026-05-18New

Joining EPA ECHO to HIFLD prison data when the spatial join fails

A fuzzy-join walkthrough for the open issue UCLA's Carceral Ecologies lab is sitting on: matching ~7,000 carceral facilities across federal datasets that share no clean key, no consistent industry code, and a lat/long column full of zipcode centroids.

entity-resolutiongoldenmatchpublic-recordsepa-echocarceral-datapython

2026-05-17New

The OSS vuln-DB 'consensus' is a redistribution artifact

Cross-source remediation agreement looks near-perfect at 99-100%. De-duplicate the mirrors and it collapses to 70%. The lift is 1,898x.

entity-resolutiongoldenmatchsecurityvulnerabilitiesmethodology

2026-05-15New

28 seeds, one corroborated lead: an Epstein-network investigation in public data

What an entity-resolution pipeline finds (and misses) when pointed at 28 publicly-sourced seeds from the Epstein corporate-network reporting.

goldenmatchentity-resolutionicijinvestigationcase-study

2026-05-15New

Phoenix Spree Deutschland: one cluster from raw leak to GLEIF anchor

A 9-member ICIJ Offshore Leaks cluster, 100% GLEIF-anchored, walked end to end from source rows through GoldenMatch dedupe to a finished report.

goldenmatchentity-resolutionicijgleifcase-study

2026-05-15New

Reconciling 4.1M shell-company records on a single Railway box

Ingesting ICIJ + GLEIF + OpenSanctions + UK PSC into one unified company table, then deduping it with GoldenMatch on a 24-vCPU Railway service.

goldenmatchentity-resolutionicijgleifrailway

2026-05-15New

88% of actively-exploited CVEs aren't in any package ecosystem

Re-running the OSS vuln reconciliation at 6.1M records and 40 sources surfaces a structural blind spot in every package-level scanner.

entity-resolutiongoldenmatchsecurityvulnerabilitiescisa-kev

2026-05-15New

The honest math on building your own MDM

A realistic person-month estimate for building an MDM platform in-house: engine, pipeline, workbench, audit, survivorship, connectors. Plus the year-2 maintenance cost nobody plans for.

mdmbuild-vs-buyengineering-managementdata-engineering

2026-05-14New

How to dedupe Salesforce accounts when you can't afford DemandTools

A RevOps tactical guide: find duplicate accounts in Salesforce, decide which one wins, and merge them. Covers native dedup, DemandTools, Cloudingo, and the open-source path.

salesforcerevopsdeduplicationdata-qualitycrm

2026-05-13New

Reltio alternatives that don't cost $5,000 a month

An honest field guide to MDM tools when your company can't justify a Reltio license. Covers DIY, the open-source middle, and the SaaS landscape — with realistic price ranges.

mdmreltioentity-resolutionbuyers-guidepricing

2026-04-20New

From 51M Orders to Golden Customers: Full-Pipeline ER at Retail Scale

Running the full Golden Suite — GoldenCheck, GoldenFlow, GoldenMatch — on a Turkish retail CRM with 10.2M orders and 100K customers across 161 branches. 67 quality findings, 67K names normalized, 11,708 duplicate clusters discovered. European decimals, Turkish diacritics, and the false-positive pressure of common names on the same street.

entity-resolutiongoldencheckgoldenflowgoldenmatchretail-datapythondata-pipeline

2026-04-16New

Wagner Was on OFAC in 2018: What 10 Years of Sanctions Data Reveals

Reconciled 85 sanctions lists + 10 years of OFAC history + a 13M-wallet attribution graph. Wagner was listed in 2018; 18% of designations get reversed.

entity-resolutionsanctionscomplianceopensanctionsofac

2026-04-13New

infermap Now Runs in TypeScript: Schema Mapping on the Edge

F1 0.840 on 162 benchmark cases — infermap's seven-scorer schema mapping engine ships on npm with zero runtime dependencies. Runs in Edge Functions, Workers, and the browser.

infermaptypescriptschema-mappingopen-sourcenpmedge-functionsdata-engineeringetl

2026-04-13New

GoldenCheck Now Runs in TypeScript: Zero-Config Data Validation at the Edge

GoldenCheck's 10 profilers, drift detection, and confidence scoring ship on npm with an edge-safe core. DQBench 88.40 — now in your browser, Workers, and Node.js.

goldenchecktypescriptdata-qualityopen-sourcenpmedge-functions

2026-04-11New

Product Catalog Dedup on a Real 1M-Row Dataset: F1 0.05 → 0.36 in Three Steps

Running the full Golden Suite (GoldenCheck → GoldenFlow → GoldenMatch) on the UCI Online Retail II catalog. Real, unsynthetic duplicates. Honest numbers — and how fixing the eval, switching to Vertex AI embeddings, and tuning the threshold lifted F1 7× from a hopeless lexical baseline.

entity-resolutiongoldencheckgoldenflowgoldenmatchecommercevertex-aipython

2026-04-10New

Reconciling 15 OSS Vulnerability Databases

Cross-database entity resolution across OSV, GHSA, PyPA, RustSec, and Go vulndb. 869k records, 608k canonical vulns, and one structural blind spot.

entity-resolutiongoldenmatchsecurityvulnerabilitiespython

2026-04-09New

Wallet Attribution at Scale: ER on 13M Blockchain Records

Running entity resolution across 10 public blockchain attribution datasets surfaces cross-jurisdictional sanctions and universal infrastructure patterns.

entity-resolutiongoldenmatchblockchainsanctionspython

2026-04-08New

The OSS ER Bargain: What Entity Resolution Costs

Benchmarking dedupe vs GoldenMatch on 500k CMS NPPES provider records. Real numbers on runtime, memory, and decisions OSS hands back to you.

entity-resolutiongoldenmatchdedupebenchmarkpythonhealthcare-data

2026-04-07New

Golden Suite + MCP: Giving AI Agents a Data Cleaning Toolkit

How the Model Context Protocol turns GoldenMatch, infermap, and GoldenPipe into tools any AI agent can call — and where we're taking it next.

mcpai-agentsgoldenmatchinfermapgoldenpipe

2026-04-06New

From Dirty CSV to Golden Records: A Python Walkthrough

Take 5,400 messy CMS hospital records from raw CSV to deduplicated golden records. Three approaches compared: zero-config, explicit tuning, LLM boost.

pythondata-cleaningdeduplicationgoldenpipegoldenmatch

2026-04-04New

10 Data Problems Every Pipeline Hits and Their Fixes

The same 10 data quality issues show up in every dataset: phone formats, broken dates, null variants. Here's what each looks like and how to fix it.

data-qualitydata-cleaningpythongoldenflowdata-engineeringetldata-transformation

2026-04-03New

GoldenMatch vs Splink vs Dedupe vs RecordLinkage

We ran four Python entity resolution libraries on the same three datasets — Febrl, DBLP-ACM, and 10K real voter records. Here's where each shines.

entity-resolutioncomparisongoldenmatchsplinkdeduperecordlinkagepythonbenchmark

2026-04-02New

GoldenMatch vs. BPID: Testing Against an EMNLP Benchmark

We benchmarked GoldenMatch on Amazon's BPID dataset — 10,000 adversarial PII pairs. With DOB parsing and Vertex AI embeddings, we hit 0.750 F1 — matching Ditto with zero training data.

entity-resolutionbenchmarkgoldenmatchpii-deduplicationpython

2026-04-01New

Deduplicating 401K Equipment Records with LLM Calibration

We ran GoldenMatch on 401,125 bulldozer auction records from Kaggle. Iterative LLM calibration learned the optimal match threshold from just 200 pairs (~$0.01). ANN hybrid blocking recovered 949 records that string blocking missed.

entity-resolutionequipment-datallmgoldenmatchpythonann-blocking

2026-03-31New

AI-Powered Deduplication: LLMs Supercharge Golden Suite

Enable LLM boost across GoldenCheck, GoldenFlow, and GoldenMatch to catch the borderline pairs fuzzy matching misses. Real costs under $0.10 per run.

llmdeduplicationdata-qualitygolden-suitepython

2026-03-30New

Getting Started with GoldenPipe: Clean Data in Python

Add a production-ready data quality pipeline to your Python backend in 5 minutes. One pip install, one function call, zero config. CSV-friendly.

pythondata-pipelinegoldenpipegetting-started

2026-03-30New

Entity Resolution on 208K Real Records with Golden Suite

We ran the full Golden Suite pipeline on 208,505 real NC voter registration records. 61 quality findings, 197K addresses cleaned, 10,718 duplicate clusters found — all in 34 seconds with zero config.

entity-resolutionreal-datagoldenmatchgoldenpipepython

Coming soon

How to Deduplicate CSV Files in Python

5 methods compared — from naive loops to production-grade entity resolution with GoldenMatch.

pythondeduplicationentity-resolution

Coming soon

Schema Mapping: What It Is and Why Your ETL Needs It

How infermap uses a weighted scorer pipeline to automatically align messy columns to your target schema.

schema-mappingetlinfermap

Coming soon

Building a Data Quality Scanner in Python

From regex checks to statistical profiling — how GoldenCheck finds problems you didn't know you had.

data-qualitypythongoldencheck