28 seeds, one corroborated lead: an Epstein-network investigation in public data
What an entity-resolution pipeline finds (and misses) when pointed at 28 publicly-sourced seeds from the Epstein corporate-network reporting.
Blog
Articles on data quality, schema mapping, and Python data engineering.
What an entity-resolution pipeline finds (and misses) when pointed at 28 publicly-sourced seeds from the Epstein corporate-network reporting.
A 9-member ICIJ cluster, 100% GLEIF-anchored, walked from source rows through GoldenMatch dedupe to a finished provenance report.
Ingesting ICIJ + GLEIF + OpenSanctions + UK PSC into one unified company table, then deduping it with GoldenMatch on a 24-vCPU Railway service.
Re-running the OSS vuln reconciliation at 6.1M records and 40 sources surfaces a structural blind spot in every package-level scanner.
A realistic person-month estimate for building an MDM platform in-house: engine, pipeline, workbench, audit, survivorship, connectors. Plus the year-2 maintenance cost nobody plans for.
A RevOps tactical guide: find duplicate accounts in Salesforce, decide which one wins, and merge them. Covers native dedup, DemandTools, Cloudingo, and the open-source path.
An honest field guide to MDM tools when your company can't justify a Reltio license. Covers DIY, the open-source middle, and the SaaS landscape — with realistic price ranges.
Running the full Golden Suite — GoldenCheck, GoldenFlow, GoldenMatch — on a Turkish retail CRM with 10.2M orders and 100K customers across 161 branches. 67 quality findings, 67K names normalized, 11,708 duplicate clusters discovered. European decimals, Turkish diacritics, and the false-positive pressure of common names on the same street.
Reconciled 85 sanctions lists + 10 years of OFAC history + a 13M-wallet attribution graph. Wagner was listed in 2018; 18% of designations get reversed.
F1 0.840 on 162 benchmark cases — infermap's seven-scorer schema mapping engine ships on npm with zero runtime dependencies. Runs in Edge Functions, Workers, and the browser.
GoldenCheck's 10 profilers, drift detection, and confidence scoring ship on npm with an edge-safe core. DQBench 88.40 — now in your browser, Workers, and Node.js.
Running the full Golden Suite (GoldenCheck → GoldenFlow → GoldenMatch) on the UCI Online Retail II catalog. Real, unsynthetic duplicates. Honest numbers — and how fixing the eval, switching to Vertex AI embeddings, and tuning the threshold lifted F1 7× from a hopeless lexical baseline.
Cross-database ER across OSV, GHSA, PyPA, RustSec, Go vulndb — 869k records, 608k canonical vulns, and one structural blind spot.
Running entity resolution across 10 public blockchain attribution datasets surfaces cross-jurisdictional sanctions and universal infrastructure patterns.
Benchmarking dedupe vs GoldenMatch on 500k CMS NPPES provider records. Real numbers on runtime, memory, and decisions OSS hands back to you.
How the Model Context Protocol turns GoldenMatch, infermap, and GoldenPipe into tools any AI agent can call — and where we're taking it next.
Take 5,400 messy hospital records from raw CSV to deduplicated golden records — zero-config, then explicit tuning, then LLM boost.
The same 10 data quality issues show up in every dataset. Here's what they look like and how to fix each in one line.
We ran four Python entity resolution libraries on the same three datasets — Febrl, DBLP-ACM, and 10K real voter records. Here's where each shines.
We benchmarked GoldenMatch on Amazon's BPID dataset — 10,000 adversarial PII pairs. With DOB parsing and Vertex AI embeddings, we hit 0.750 F1 — matching Ditto with zero training data.
We ran GoldenMatch on 401,125 bulldozer auction records from Kaggle. Iterative LLM calibration learned the optimal match threshold from just 200 pairs (~$0.01). ANN hybrid blocking recovered 949 records that string blocking missed.
Enable LLM boost across GoldenCheck, GoldenFlow, and GoldenMatch to catch what fuzzy matching misses — with real costs under $0.10.
Add a production-ready data quality pipeline to your Python backend in 5 minutes. One pip install, one function call, zero config.
We ran the full Golden Suite pipeline on 208,505 real NC voter registration records. 61 quality findings, 197K addresses cleaned, 10,718 duplicate clusters found — all in 34 seconds with zero config.
5 methods compared — from naive loops to production-grade entity resolution with GoldenMatch.
How infermap uses a weighted scorer pipeline to automatically align messy columns to your target schema.
From regex checks to statistical profiling — how GoldenCheck finds problems you didn't know you had.