2026-04-20/Ben Severn

From 51M Orders to Golden Customers: Full-Pipeline ER at Retail Scale

Running the full Golden Suite — GoldenCheck, GoldenFlow, GoldenMatch — on a Turkish retail CRM with 10.2M orders and 100K customers across 161 branches. 67 quality findings, 67K names normalized, 11,708 duplicate clusters discovered. European decimals, Turkish diacritics, and the false-positive pressure of common names on the same street.

entity-resolutiongoldencheckgoldenflowgoldenmatchretail-datapythondata-pipeline

Every retail CRM accumulates ghosts. The same customer registers twice — once from the mobile app, once in-store. A name gets entered as "Hakan Koçak" in one system and "Hakan Koç" in another. A family of four shares an apartment building and three of them share a surname. At 100K accounts, these collisions aren't edge cases — they're structural.

This post takes the 3A Superstore dataset from Kagglea Turkish retail CRM with 161 branches, 99,998 customers, 10.2 million orders, and 51 million line items — and runs the full Golden Suite pipeline (GoldenCheck → GoldenFlow → GoldenMatch) across it. No injected dirty data. Whatever's in the data, we report.

The dataset is synthetic but realistic, modeled after large Turkish supermarket chains. That means the names are real Turkish names, the addresses follow real Turkish postal conventions, and the data quality issues — European decimal formats, diacritical encoding, inconsistent address structures — are exactly what you'd hit in production.

The Dataset

The 3A Superstore dataset ships as five interconnected CSV files:

TableRowsKey ColumnsDelimiter
Customers99,998USERID, NAMESURNAME, USERBIRTHDATE, REGION, CITY, TOWN, DISTRICT, ADDRESSTEXT;
Orders10,235,193ORDERID, BRANCH_ID, USERID, NAMESURNAME, TOTALBASKET,
Order Details51,185,032ORDERID, AMOUNT, UNITPRICE, TOTALPRICE, ITEMID,
Categories27,000ITEMID, CATEGORY1–4, BRAND, ITEMNAME;
Branches957BRANCH_ID, REGION, CITY, TOWN, LAT, LON;

Two things jump out before you even open the files: the reference tables use semicolons while the transactional tables use commas, and the money columns use European decimal format (2637,5499999999997 instead of 2637.55).

import polars as pl

customers = pl.read_csv("Customers_ENG.csv", separator=";")
orders = pl.read_csv("Orders.csv")

print(f"Customers: {customers.shape}")
print(f"Orders: {orders.shape}")
# Customers: (99998, 11)
# Orders: (10235193, 6)

NAMESURNAME appears in both tables — denormalized from Customers into every one of the 10.2M order rows. In a real system, this is where drift accumulates: the customer updates their name, but the old orders keep the original spelling. Here, the synthetic data is perfectly consistent — zero mismatches across all 10.2M rows. That's our first signal that this dataset is cleaner than production, but the structural patterns (name collisions, address overlap) are still real.

Stage 1: GoldenCheck — Profiling

GoldenCheck scans a dataset and reports quality findings without any configuration. Let's profile the two main tables.

Customers (99,998 rows)

import goldencheck

findings, profile = goldencheck.scan_file("customers_std.csv", sample_size=100_000)
print(f"Findings: {len(findings)} in {1.74:.2f}s")
# Findings: 67 in 1.74s

67 findings in under 2 seconds. The severity breakdown:

SeverityCountExamples
Medium14USERID outliers, sequence gaps, NAMESURNAME categorical drift, TOWN/DISTRICT pattern inconsistencies
Low53Non-ASCII detection, email pattern, name pattern variants, birthdate range, enum candidates

The most interesting findings:

Non-ASCII names. 76,682 of 99,998 NAMESURNAME values contain non-ASCII characters — Turkish diacritics like ş, ö, ç, ğ, ı, İ, ü. GoldenCheck flags this not as an error but as a verification prompt: is this intentional international text, or is it mojibake?

[severity 1] NAMESURNAME: 76,682 value(s) contain non-ASCII / unicode characters
    — verify encoding is intentional (international text vs. mojibake)

Name collision pressure. Only 42,748 unique names for 99,998 accounts. That means 57% of accounts share a name with at least one other account.

Names appearing N timesCount
Exactly 1× (unique)29,534
1,452
1,687
2,152
5+7,923

The most common name — "Suna Karaoğlu" — appears 21 times. This is realistic for Turkish data: a relatively small pool of given names (Mehmet, Ayşe, Ahmet, Fatma) combined with common surnames (Yılmaz, Kaya, Demir, Şahin) creates enormous collision pressure.

TOWN pattern inconsistencies. Most TOWNs follow a simple LLLLLLL pattern, but 31 entries contain slashes (LLLLLLLLLL/LLLLLL) — compound town names that weren't normalized on ingestion.

STATUS_ is constant. Every single row has STATUS_ = 1. A one-value column occupying 100K rows of storage. GoldenCheck correctly flags it as an enum candidate.

Orders (10.2M rows)

findings, profile = goldencheck.scan_file("Orders.csv", sample_size=100_000)
print(f"Findings: {len(findings)} in {4.30:.2f}s")
# Findings: 29 in 4.30s

29 findings in 4.3 seconds (sampling 100K from 10.2M). The headline:

TOTALBASKET is a string. The European decimal format — "2637,5499999999997" — means Polars reads the column as String instead of Float64. GoldenCheck detects the mismatch:

[severity 1] TOTALBASKET: Column is string but 1.2% of values appear numeric
    (1,189 values) — possible data entry error

And in the pattern analysis:

[severity 1] TOTALBASKET: Inconsistent pattern detected:
    'DDDDD,DDDDDDDDDDD' appears in 7 row(s) vs dominant pattern 'DDD,DD' (26,620)

That DDDDD,DDDDDDDDDDD pattern is the floating-point precision artifact — 2637,5499999999997 instead of 2637,55. It affects 3.4 million rows (33.3% of orders).

Stage 2: GoldenFlow — Cleaning

GoldenFlow auto-learns a cleaning config from the data and applies transforms. Let's clean the Customers table.

from pathlib import Path
import goldenflow

config = goldenflow.learn_config(Path("customers_std.csv"))
result = goldenflow.transform_file(Path("customers_std.csv"))
# 44 transforms applied, 0 errors, 4.56s

GoldenFlow selected 44 transforms across 8 columns. Most are no-ops on this already-clean synthetic data, but two transforms make real changes:

normalize_unicode: 67,406 rows changed

NAMESURNAME: normalize_unicode — 67,406 rows changed
  Before: Atilla Keleş → After: Atilla Keles
  Before: Zeynep Şeyma Altun → After: Zeynep Seyma Altun

This is the most important decision in the pipeline — and it's the wrong one for Turkish data.

In English, Unicode normalization (stripping diacritics) is almost always safe: "café" → "cafe" loses nothing meaningful. In Turkish, it destroys information. "Keleş" and "Keles" are different surnames. The Turkish alphabet has six letters that don't exist in ASCII — ç, ğ, ı, İ, ö, ü, ş — and stripping them collapses distinct names into the same string.

This is GoldenFlow working as designed: it applies normalize_unicode because 76% of names contain non-ASCII characters, and the transform's purpose is to standardize encoding. But the right call here is to keep the diacritics and rely on GoldenMatch's fuzzy scoring to handle encoding-variant matches. Lesson: auto-learned configs need human review, especially on international data.

collapse_whitespace: 99,997 rows changed

ADDRESSTEXT: collapse_whitespace — 99,997 rows changed
  Before: FATIH MAH. SEHIT AHMET KOC SOKAK  26220  TEPEBASI/ESKISEHIR
  After:  FATIH MAH. SEHIT AHMET KOC SOKAK 26220 TEPEBASI/ESKISEHIR

Almost every address has double spaces separating the postal code from the street and town. Collapsing to single spaces is unambiguously correct and makes downstream fuzzy matching more accurate.

The Orders TOTALBASKET Fix

The European decimal fix on 10.2M rows is a one-liner:

orders = orders.with_columns(
    pl.col("TOTALBASKET").str.replace(",", ".").cast(pl.Float64)
)
# 10.2M rows fixed in 0.11s

With the fix applied:

StatValue
Min0.00 TRY
Max307,683.45 TRY
Mean1,280.45 TRY
Median935.43 TRY
Total revenue13.1 billion TRY
Float-precision artifacts3,403,728 rows (33.3%)

One-third of all orders had floating-point precision artifacts (2637,5499999999997 instead of 2637,55). In a reporting pipeline, that's the difference between ₺2,637.55 and ₺2,637.5499999999997 showing up in a customer invoice.

Stage 3: GoldenMatch — Deduplication

Now the main event: running GoldenMatch on 99,998 customer accounts to find duplicates.

Auto-Configuration

GoldenMatch's auto_configure analyzes the schema and selects blocking strategies, field weights, and scorers:

import goldenmatch

config = goldenmatch.auto_configure([("customers_std.csv", "customers")])
result = goldenmatch.dedupe("customers_std.csv", config=config)

The auto-config chose:

Blocking — three passes with union mode:

  1. CITY + NAMESURNAME (exact) — catches same-name, same-city pairs
  2. CITY + NAMESURNAME[:5] (substring) — catches truncated/abbreviated names
  3. NAMESURNAME (soundex) — catches phonetic variants across cities

Scoring — weighted fuzzy fields:

FieldWeightScorer
ADDRESSTEXT1.5×token_sort
NAMESURNAME1.0×ensemble
TOWN1.0×ensemble
DISTRICT0.3×token_sort

Address gets the highest weight — in a CRM, the address is the strongest signal that two accounts belong to the same household.

Results

Total records:   99,998
Total clusters:  11,708
Scored pairs:    89,022
Match rate:      51.1%
Runtime:         111.52s

11,708 clusters containing 87,646 records — GoldenMatch grouped just over half the dataset into potential duplicate clusters. That's a lot. Let's look at the score distribution to understand why.

Score Distribution

Score RangePairs
≥ 0.95199
0.90 – 0.951,390
0.85 – 0.902,670
0.80 – 0.8514,877
0.75 – 0.8025,954
0.70 – 0.7543,932

The distribution is bottom-heavy — 70% of pairs score below 0.80. This is the name collision pressure we saw in GoldenCheck: when 57% of accounts share a name with someone else, and Turkish neighborhoods reuse the same names (FATIH MAH., YENI MAH., CUMHURIYET MAH. exist in every city), fuzzy matching generates enormous volumes of low-confidence pairs.

What the Top Pairs Look Like

The 199 pairs scoring above 0.95 are where the real signal lives:

Same name, same neighborhood, neighboring house numbers (0.992):

"Melike Korkmaz" — BAHCELIEVLER MAH. 1300. SOKAK 01285 YUREGIR/ADANA "Melike Korkmaz" — BAHCELIEVLER MAH. 1400. SOKAK 01285 YUREGIR/ADANA

Same name, same mahalle, same postal code, same city. House numbers differ by 100 — could be the same apartment complex.

Middle-name variant, exact same address (0.985):

"Yasemin Dilan Bozkurt" — NARLIKUYU-KIZILISALI MAH. ATATURK CADDESI SILIFKE/MERSIN "Yasemin Bozkurt" — NARLIKUYU-KIZILISALI MAH. ATATURK CADDESI SILIFKE/MERSIN

Identical address. One record includes the middle name "Dilan", the other doesn't. In a real CRM, this is a textbook duplicate — the same person registered twice with slightly different name entries.

Surname truncation (0.981):

"Hakan Koçak" — YAYLA MAH. 1546. SOKAK 06220 KECIOREN/ANKARA "Hakan Koç" — YAYLA MAH. 1468. SOKAK 06220 KECIOREN/ANKARA

"Koçak" vs "Koç" — the longer surname contains the shorter one. Same mahalle, same postal code.

And the false positives start at ~0.97:

"Kerem Kıvanç" — KARAKOPRU-SENEVLER MAH. 6280. SOKAK SANLIURFA "Kerem Akın" — KARAKOPRU-SENEVLER MAH. 6100. SOKAK SANLIURFA

Same first name, same neighborhood, neighboring house numbers — but completely different surnames. The address similarity (weight 1.5×) is strong enough to push the score above 0.97 even though the surnames don't match. This is a false positive.

The Honest Assessment

This dataset is synthetic. The creator assigned each customer a unique USERID with no intentional duplicates. The "duplicates" GoldenMatch finds are coincidental collisions — real Turkish names that happen to share neighborhoods. In production data, you'd see the same patterns but with actual duplicate accounts mixed in.

The takeaway isn't "GoldenMatch found 11,708 duplicates" — it's that at 100K accounts with Turkish name distribution, the matching pipeline surfaces 199 high-confidence pairs that warrant human review, with a clear score cliff between likely duplicates (≥ 0.95) and coincidental neighbors (< 0.95).

A production workflow would:

  1. Auto-merge pairs above 0.98 (same name + same address = clearly the same person)
  2. Queue for review pairs from 0.90–0.98 (name variants, address overlap — needs a human look)
  3. Discard below 0.90 (too much noise from common names)

Stage 4: Golden Records — Surfacing Hidden Revenue

Finding duplicates is half the story. The payoff is what happens when you merge them: GoldenMatch builds a golden record for each cluster and you reassign all orders to the canonical customer. Suddenly, fragmented purchase histories become a single, complete view.

# result.golden contains one canonical record per cluster
# result.clusters maps cluster_id → member row indices
print(result.golden.shape)
# (11708, 13) — one golden record per cluster

# Filter to high-confidence, small clusters (real signal)
good_clusters = {
    cid: info for cid, info in result.clusters.items()
    if 2 <= info["size"] <= 4 and info["confidence"] > 0.9
}
# 436 clusters, 881 accounts

Filtering to 436 high-confidence clusters (size 2–4, confidence > 0.9) and joining their orders back to the golden record:

Before vs After: Three Examples

Cluster 1291 — "Melike Korkmaz" (score: 0.992)

Account AAccount BGolden Record
NameMelike KorkmazMelike KorkmazMelike Korkmaz
AddressBAHCELIEVLER MAH. 1300. SOKAKBAHCELIEVLER MAH. 1400. SOKAKBAHCELIEVLER MAH.
Orders108104212
Spend135,277 TRY129,099 TRY264,376 TRY

Before: two mid-tier accounts. After: a top-5% customer with 264K TRY in lifetime spend — invisible to any segmentation model that saw them as separate entities.

Cluster 59754 — "Ferhat Uçar" household (score: 0.966)

Three accounts at the same postal code (54400 AKYAZI/SAKARYA), two named "Ferhat":

Ferhat Kürşat UçarFeride Esila KılıçFerhat UçarGolden Record
Orders95101102298
Spend123,995 TRY134,928 TRY117,044 TRY375,967 TRY

A family unit shopping under three accounts. The golden record reveals +179% spend lift — from a 135K individual to a 376K household.

Cluster 3207 — "Yasemin" (score: 0.969)

Yasemin KeskinYasemin AkınGolden Record
AddressALTINBASAK-PINARBASI MAH. 71. SOKAKALTINBASAK-PINARBASI MAH. 75. SOKAKALTINBASAK-PINARBASI MAH.
Orders106123229
Spend160,620 TRY160,602 TRY321,222 TRY

Same mahalle, neighboring houses (71 vs 75), different surnames. Without ER, each account sits exactly at the 160K mark — neither qualifies as a VIP. Merged, they cross 320K and enter a completely different tier.

Aggregate Impact

Across all 436 high-confidence clusters:

MetricValue
Accounts merged881
Hidden orders surfaced42,999
Previously visible revenue62.3M TRY
Newly surfaced revenue53.9M TRY
Average spend lift per cluster+87%

53.9 million TRY in customer spend was invisible before the merge — fragmented across accounts that no single-account query would ever connect. That's not a data quality exercise. That's a revenue discovery exercise.

What This Means in Practice

A loyalty program sees each fragment as a separate, mediocre customer. Neither gets the VIP treatment, neither gets the retention outreach, neither gets the volume discount. After ER:

The golden record isn't just cleaner data. It's a different understanding of who your customers are.

The Numbers

StageInputOutputTime
GoldenCheck (Customers)99,998 rows × 11 cols67 findings (14 medium)1.74s
GoldenCheck (Orders)10.2M rows × 6 cols29 findings (3 medium)4.30s
GoldenFlow (Customers)99,998 rows67,406 names normalized, 99,997 addresses cleaned4.56s
GoldenFlow (Orders)10.2M decimal fixes3.4M precision artifacts fixed0.11s
GoldenMatch (Customers)99,998 accounts11,708 clusters, 199 high-confidence pairs111.52s
Golden Records436 high-conf clusters881 accounts → 436 golden records, +53.9M TRY surfaced<1s

Total wall time for the full pipeline: ~2 minutes on a laptop. No GPUs, no cluster, no Spark. The scale funnel — 51M line items → 10.2M orders → 99,998 customers → 436 golden records → 53.9M TRY in previously hidden revenue — is the story of progressive data reduction, from raw transactional noise to actionable customer intelligence.

What We Learned

Turkish diacritics are not noise. GoldenFlow's normalize_unicode stripped ş→s, ö→o, ç→c across 67K names. For English data, that's a safe default. For Turkish (and any language with meaningful diacritics — German, Vietnamese, Polish), you should override it. The diacritics carry information that fuzzy matching can handle on its own.

European decimals hide at scale. One-third of 10.2M orders had floating-point precision artifacts from the comma-to-period conversion. GoldenCheck caught the symptom (string column that looks numeric, inconsistent digit patterns). The fix is trivial, but the detection matters — in a pipeline that auto-casts types, these artifacts silently corrupt aggregations.

Name collision pressure defines your threshold. In a US-style CRM with Anglo names, a 0.80 threshold might be precise enough. In Turkish data with a concentrated name distribution, the same threshold produces 40K+ false positive pairs. The score distribution tells you where to draw the line — here, 0.95 separates signal from noise.

Blocking strategy matters more than scoring. The first attempt used single-field blocking on CITY. Istanbul alone produced an 18,437-record block that exhausted memory (O(n²) comparison matrix). Auto-config's multi-pass strategy — CITY+NAME, CITY+substring, soundex — kept blocks small while maintaining recall.

The full Golden Suite pipeline — profile, clean, deduplicate — processed 10.3 million records in two minutes. The honest result isn't a triumphant F1 score against ground truth (there is none). It's a ranked list of 89,022 candidate pairs, stratified by confidence, with the 199 highest-scoring pairs representing genuinely suspicious account overlaps that any CRM administrator would want to review.

That's what entity resolution looks like on real-shaped data: not a binary "duplicate / not duplicate" answer, but a confidence-ranked queue that lets you decide where to draw the line.