Streaming vs batch entity resolution
Batch ER processes a snapshot; streaming ER handles records as they arrive. Tradeoffs, when each fits, and how late-arriving data complicates both.
Entity resolution runs in two operational modes. Batch mode processes a snapshot of records as one job. Streaming mode handles records as they arrive, incrementally matching each new record against the existing golden source.
Batch vs streaming at a glance
| Batch ER | Streaming ER | |
|---|---|---|
| Trigger | Scheduled or manual | Per-record event |
| Throughput | High (millions per run) | Lower per-record but continuous |
| Latency | Hours to days | Seconds |
| Re-clustering | Full recompute per run | Incremental, with periodic rebuilds |
| Best for | Historical loads, periodic refresh, MDM bootstrap | Real-time onboarding, fraud screening, live deduplication |
Late-arriving data complicates both
The hard case is a record that arrives after a cluster has already been built. In batch mode you rerun and the cluster reforms. In streaming mode you have to choose between (a) creating a new entity and merging later, (b) probabilistically attaching to an existing entity immediately, or (c) holding the record in a quarantine queue until enough confidence accrues. Each choice trades latency for correctness. There is no universally right answer: the right trade-off depends on how expensive a wrong merge is versus how long the downstream system can tolerate an unresolved record.
When each fits
Pick batch when the underlying data refreshes on a schedule (nightly customer feed, weekly product catalog) and downstream consumers can tolerate stale matches between runs. Pick streaming when matches need to settle within seconds: real-time fraud, patient registration, sanctions screening. Or when correcting bad matches downstream is expensive. Most production systems run streaming on the live path with a periodic batch rebuild that fixes drift accumulated since the last full recompute.
In Golden Suite
Today goldenmatch is batch-first. The Enterprise tier exposes incremental match APIs and a streaming connector path. The underlying engine accepts single-record inputs, so the architecture supports both modes from the same core without rewriting your match configuration.