The identity graph
How matching is becoming an explicit edge graph: records are nodes, matches are edges, clusters are connected components — with provenance and a parity gate.
Entity resolution produces clusters — groups of records that refer to the same real-world entity. Under the hood, a cluster is just a connected component of a graph: records are nodes, and a "these two match" decision is an edge. The identity graph makes that graph explicit and durable instead of leaving it implicit in the clustering output.
Records, edges, clusters
- Node — a raw record (a row from one of your sources).
- Edge — a scored match between two records, with the matchkey that produced it.
- Cluster — a connected component: follow the edges transitively and every record you can reach is the same entity.
The resolver already computes clusters. The identity graph additionally persists the edges that justify each cluster, so the partition is reproducible from first principles: run connected-components over the retained edges and you get exactly the materialized clusters back.
Why make it explicit
Keeping matches as first-class edges (rather than only the final cluster ids) buys three things:
- Provenance — for any entity you can answer "which pairwise decisions, with which scores and matchkeys, produced this cluster?" That powers the lineage view end to end.
- A parity gate — because connected-components-over-edges should equal the materialized partition, the two can be cross-checked every run. A divergence is a real signal (a clustering bug, a stale edge), surfaced rather than silently absorbed.
- A shared substrate — different match producers can write into the same graph. Resolver matches and PPRL matches both become edges; a
kinddiscriminator keeps them isolated so a privacy-preserving linkage can never perturb the resolve graph or its parity check.
The parity check
After a resolve run, the engine compares two views of the same data:
- the materialized partition (the clusters the pipeline actually wrote), and
- the connected components of the persisted edges.
If they match, parity holds. If they diverge — say an edge is missing so a cluster over-splits — that's logged with a structured signal (how many components, how many divergent records) and surfaced on the admin health dashboard as a parity-mismatch rate. Edges that point at records outside the current partition (e.g. a since-deleted row) are ignored, so stale references don't trigger false mismatches.
Status
The explicit edge layer is additive and off by default — it does not change how clustering works or what the default resolve path produces. It's the seam a richer identity backing plugs into over time (graph-native stores, cross-run stable entity ids). Today its first concrete consumer is the optional PPRL edge write: a privacy-preserving linkage can record its cross-party matches as kind='pprl' edges, kept provably separate from the kind='resolve' graph.