Product catalog deduplication
Deduplicate SKUs across e-commerce + ERP + supplier feeds. The retail/marketplace playbook.
Note: Golden Suite is MDM, not PIM. This guide is about the deduplication side of product master data — collapsing duplicate SKUs across systems — not full PIM features like multi-language descriptions, marketing assets, or retailer-specific overrides. For full PIM, see Stibo or Akeneo.
Product catalogs accumulate duplicates faster than customer or vendor records. New supplier feeds arrive monthly. The same product gets entered with slightly different titles. Bundles, variants, and parent-child relationships make matching harder. This guide walks the deduplication side.
What product dedup actually solves
- Same SKU, different titles. "Nike Air Max 270, Size 10, Black" vs "Nike AirMax 270 Black US10" vs "Air Max 270 Mens Blk Size 10" — all the same product, three records. Need to collapse to one.
- Variant explosion. A "shirt in 5 colors x 3 sizes" creates 15 SKU rows that should roll up to one parent product with 15 variants. Matching needs to understand the parent-variant hierarchy.
- Cross-supplier matching. Two suppliers ship the same product under different supplier SKUs. You want one master product record with both supplier SKUs as members.
- GTIN / UPC / EAN matching. When you have the barcode, dedup is deterministic and easy. When you don't (~40% of catalogs), you fall back to fuzzy matching on title + brand + attributes.
Setup sequence
Week 1 — Source the catalog
Common product data sources:
- E-commerce platform (Shopify, WooCommerce, BigCommerce) — product table, variants, inventory
- ERP (NetSuite, SAP) — internal SKU, cost, supplier mapping
- Supplier feeds (CSV / EDI / API) — supplier SKU, MSRP, lead time, GTIN
- PIM (if you have one — Akeneo, Salsify, Stibo) — marketing-shaped descriptions, images
- Marketplaces (Amazon, Walmart, eBay) — marketplace-specific listings tied back to internal SKUs
Pick the 2-3 most authoritative sources first. Often: ERP + e-commerce + supplier feed. Save marketplace listings + PIM for downstream consumers (write golden records to PIM, don't dedupe from it).
Week 2 — Identifier-driven matching
When available, use these as exact-match blocking signals in priority order:
- GTIN / UPC / EAN — global identifier; if it matches, products are the same
- Supplier SKU + supplier — same supplier's same SKU = same product
- Internal SKU — your own canonical identifier
When GTIN/UPC isn't present (it often isn't for private-label or smaller suppliers), fall back to:
- Brand + model number — fuzzy match on brand, exact on model
- Title + brand — fuzzy title match scoped to same brand
- Title + attributes — color, size, dimensions extracted from title
The trap: free-text title matching alone is noisy. "iPhone 15 Pro Max 256GB Titanium Black" matches "iPhone 15 Pro 256GB Black" at high fuzzy score, but they're different products. Always combine with at least one constraining attribute.
Week 3 — Variants + parent-child
Product MDM has a structure customer MDM doesn't: parent products and variants. A "shirt" has color + size variants. The dedup pipeline needs to handle this.
Two patterns work:
- Flat matching, post-hoc rollup. Run dedup on all SKUs flat. After clustering, post-process to identify variant groups (same brand + base title + differing attributes) and create parent records.
- Hierarchical matching. Pre-process to extract
(parent_title, variant_attrs)per SKU. Match on parent_title; treat variants as members of the parent cluster.
Pattern 2 is cleaner; pattern 1 is simpler to implement. Start with pattern 1.
Week 4 — Surface to downstream
Once you have deduplicated golden products:
- Push to PIM (if applicable) as the canonical product list. PIM owns the marketing-shape; you own the dedup.
- Update e-commerce with merged SKUs (use redirects for old URLs!).
- Reconcile inventory across systems — once "the same product" is one record, inventory aggregation gets easy.
Survivorship for product fields
brand— source priority (PIM > ERP > supplier feed)title— source priority (PIM > ecommerce) — these are marketing-curateddescription— most complete from PIMgtin— exact-match across sources; if disagreement, escalate (probably a data-entry error)cost— most recent from ERP (this changes; you want latest)msrp— most recent from supplier feedweight+dimensions— most complete (sources without measurements default to NULL)category— source priority (PIM > ecommerce > supplier feed)
Common pitfalls
- Auto-merging on fuzzy title alone. Almost always wrong. Require at least one constraining attribute (brand + model, brand + size, brand + color).
- Ignoring variant relationships. If you collapse two variants into one record, you lose inventory tracking. Always preserve variant identity within the parent cluster.
- Treating supplier feeds as ground truth. Suppliers re-use SKUs, change formats, and sometimes ship dirty data. Use supplier SKU as a strong signal but not gospel.
- One-time dedup with no ongoing process. Catalogs grow weekly; rerun the pipeline at least weekly, ideally daily.
When NOT to use Golden Suite for product MDM
- You need full PIM features. Multi-language descriptions, retailer-specific overrides, digital asset management — those are PIM territory. Stibo, Akeneo, Salsify.
- You need real-time deduplication. Today's pipeline is batch (hourly to daily). If you need "merge this SKU as it's being added in real time," you need a different shape.
- You only have 500 SKUs. Spreadsheet + manual review beats any tool.
Next steps
- Compare: Stibo Systems — the leading PIM + MDM combo
- Concept: entity resolution
- /glossary/schema-inference — auto-map columns from a new supplier feed