Tag · python

Posts on python

14 posts tagged python.

2026-05-18

Joining EPA ECHO to HIFLD prison data when the spatial join fails

A fuzzy-join walkthrough for the open issue UCLA's Carceral Ecologies lab is sitting on: matching ~7,000 carceral facilities across federal datasets that share no clean key, no consistent industry code, and a lat/long column full of zipcode centroids.

entity-resolutiongoldenmatchpublic-recordsepa-echocarceral-datapython
2026-04-20

From 51M Orders to Golden Customers: Full-Pipeline ER at Retail Scale

Running the full Golden Suite — GoldenCheck, GoldenFlow, GoldenMatch — on a Turkish retail CRM with 10.2M orders and 100K customers across 161 branches. 67 quality findings, 67K names normalized, 11,708 duplicate clusters discovered. European decimals, Turkish diacritics, and the false-positive pressure of common names on the same street.

entity-resolutiongoldencheckgoldenflowgoldenmatchretail-datapythondata-pipeline
2026-04-11

Product Catalog Dedup on a Real 1M-Row Dataset: F1 0.05 → 0.36 in Three Steps

Running the full Golden Suite (GoldenCheck → GoldenFlow → GoldenMatch) on the UCI Online Retail II catalog. Real, unsynthetic duplicates. Honest numbers — and how fixing the eval, switching to Vertex AI embeddings, and tuning the threshold lifted F1 7× from a hopeless lexical baseline.

entity-resolutiongoldencheckgoldenflowgoldenmatchecommercevertex-aipython
2026-04-10

Reconciling 15 OSS Vulnerability Databases

Cross-database ER across OSV, GHSA, PyPA, RustSec, Go vulndb — 869k records, 608k canonical vulns, and one structural blind spot.

entity-resolutiongoldenmatchsecurityvulnerabilitiespython
2026-04-09

Wallet Attribution at Scale: ER on 13M Blockchain Records

Running entity resolution across 10 public blockchain attribution datasets surfaces cross-jurisdictional sanctions and universal infrastructure patterns.

entity-resolutiongoldenmatchblockchainsanctionspython
2026-04-08

The OSS ER Bargain: What Entity Resolution Costs

Benchmarking dedupe vs GoldenMatch on 500k CMS NPPES provider records. Real numbers on runtime, memory, and decisions OSS hands back to you.

entity-resolutiongoldenmatchdedupebenchmarkpythonhealthcare-data
2026-04-06

From Dirty CSV to Golden Records: A Python Walkthrough

Take 5,400 messy hospital records from raw CSV to deduplicated golden records — zero-config, then explicit tuning, then LLM boost.

pythondata-cleaningdeduplicationgoldenpipegoldenmatch
2026-04-04

10 Data Problems Every Pipeline Hits and Their Fixes

The same 10 data quality issues show up in every dataset. Here's what they look like and how to fix each in one line.

data-qualitydata-cleaningpythongoldenflowdata-engineeringetldata-transformation
2026-04-03

GoldenMatch vs Splink vs Dedupe vs RecordLinkage

We ran four Python entity resolution libraries on the same three datasets — Febrl, DBLP-ACM, and 10K real voter records. Here's where each shines.

entity-resolutioncomparisongoldenmatchsplinkdeduperecordlinkagepythonbenchmark
2026-04-02

GoldenMatch vs. BPID: Testing Against an EMNLP Benchmark

We benchmarked GoldenMatch on Amazon's BPID dataset — 10,000 adversarial PII pairs. With DOB parsing and Vertex AI embeddings, we hit 0.750 F1 — matching Ditto with zero training data.

entity-resolutionbenchmarkgoldenmatchpii-deduplicationpython
2026-04-01

Deduplicating 401K Equipment Records with LLM Calibration

We ran GoldenMatch on 401,125 bulldozer auction records from Kaggle. Iterative LLM calibration learned the optimal match threshold from just 200 pairs (~$0.01). ANN hybrid blocking recovered 949 records that string blocking missed.

entity-resolutionequipment-datallmgoldenmatchpythonann-blocking
2026-03-31

AI-Powered Deduplication: LLMs Supercharge Golden Suite

Enable LLM boost across GoldenCheck, GoldenFlow, and GoldenMatch to catch what fuzzy matching misses — with real costs under $0.10.

llmdeduplicationdata-qualitygolden-suitepython
2026-03-30

Getting Started with GoldenPipe: Clean Data in Python

Add a production-ready data quality pipeline to your Python backend in 5 minutes. One pip install, one function call, zero config.

pythondata-pipelinegoldenpipegetting-started
2026-03-30

Entity Resolution on 208K Real Records with Golden Suite

We ran the full Golden Suite pipeline on 208,505 real NC voter registration records. 61 quality findings, 197K addresses cleaned, 10,718 duplicate clusters found — all in 34 seconds with zero config.

entity-resolutionreal-datagoldenmatchgoldenpipepython