The identity graph

How matching is becoming an explicit edge graph: records are nodes, matches are edges, clusters are connected components — with provenance and a parity gate.

Entity resolution produces clusters — groups of records that refer to the same real-world entity. Under the hood, a cluster is just a connected component of a graph: records are nodes, and a "these two match" decision is an edge. The identity graph makes that graph explicit and durable instead of leaving it implicit in the clustering output.

Records, edges, clusters

Node — a raw record (a row from one of your sources).
Edge — a scored match between two records, with the matchkey that produced it.
Cluster — a connected component: follow the edges transitively and every record you can reach is the same entity.

The resolver already computes clusters. The identity graph additionally persists the edges that justify each cluster, so the partition is reproducible from first principles: run connected-components over the retained edges and you get exactly the materialized clusters back.

Why make it explicit

Keeping matches as first-class edges (rather than only the final cluster ids) buys three things:

Provenance — for any entity you can answer "which pairwise decisions, with which scores and matchkeys, produced this cluster?" That powers the lineage view end to end.
A parity gate — because connected-components-over-edges should equal the materialized partition, the two can be cross-checked every run. A divergence is a real signal (a clustering bug, a stale edge), surfaced rather than silently absorbed.
A shared substrate — different match producers can write into the same graph. Resolver matches and PPRL matches both become edges; a kind discriminator keeps them isolated so a privacy-preserving linkage can never perturb the resolve graph or its parity check.

The parity check

After a resolve run, the engine compares two views of the same data:

the materialized partition (the clusters the pipeline actually wrote), and
the connected components of the persisted edges.

If they match, parity holds. If they diverge — say an edge is missing so a cluster over-splits — that's logged with a structured signal (how many components, how many divergent records) and surfaced on the admin health dashboard as a parity-mismatch rate. Edges that point at records outside the current partition (e.g. a since-deleted row) are ignored, so stale references don't trigger false mismatches.

Status

The explicit edge layer is additive and off by default — it does not change how clustering works or what the default resolve path produces. It's the seam a richer identity backing plugs into over time (graph-native stores, cross-run stable entity ids). Today its first concrete consumer is the optional PPRL edge write: a privacy-preserving linkage can record its cross-party matches as kind='pprl' edges, kept provably separate from the kind='resolve' graph.

Was this page helpful?

Edit this page on GitHub

PreviousPrivacy-preserving record linkage (PPRL)

The identity graph

Records, edges, clusters

Why make it explicit

The parity check

Status

Related