Streaming vs batch entity resolution

Batch ER processes a snapshot; streaming ER handles records as they arrive. Tradeoffs, when each fits, and how late-arriving data complicates both.

Entity resolution runs in two operational modes. Batch mode processes a snapshot of records as one job. Streaming mode handles records as they arrive, incrementally matching each new record against the existing golden source.

Batch vs streaming at a glance

Batch ERStreaming ER
TriggerScheduled or manualPer-record event
ThroughputHigh (millions per run)Lower per-record but continuous
LatencyHours to daysSeconds
Re-clusteringFull recompute per runIncremental, with periodic rebuilds
Best forHistorical loads, periodic refresh, MDM bootstrapReal-time onboarding, fraud screening, live deduplication

Late-arriving data complicates both

The hard case is a record that arrives after a cluster has already been built. In batch mode you rerun and the cluster reforms. In streaming mode you have to choose between (a) creating a new entity and merging later, (b) probabilistically attaching to an existing entity immediately, or (c) holding the record in a quarantine queue until enough confidence accrues. Each choice trades latency for correctness. There is no universally right answer: the right trade-off depends on how expensive a wrong merge is versus how long the downstream system can tolerate an unresolved record.

When each fits

Pick batch when the underlying data refreshes on a schedule (nightly customer feed, weekly product catalog) and downstream consumers can tolerate stale matches between runs. Pick streaming when matches need to settle within seconds: real-time fraud, patient registration, sanctions screening. Or when correcting bad matches downstream is expensive. Most production systems run streaming on the live path with a periodic batch rebuild that fixes drift accumulated since the last full recompute.

In Golden Suite

Today goldenmatch is batch-first. The Enterprise tier exposes incremental match APIs and a streaming connector path. The underlying engine accepts single-record inputs, so the architecture supports both modes from the same core without rewriting your match configuration.