Link two datasets without sharing PII

Walkthrough: use the /pprl page to match records across two sources on encoded identifiers — pick two sources, choose link fields, run, and download a PII-free mapping.

This guide walks through running a privacy-preserving record linkage (PPRL) on bensevern.dev — matching records across two sources on encoded identifiers so neither side's raw PII is exposed in the result.

Reach for PPRL when plaintext can't cross a boundary but you still need the overlap: cross-hospital patient matching, sanctions screening, partner data sharing, or any "which of our records are also in their list?" question under HIPAA / GDPR constraints.

Note: For the concepts (Bloom-filter / CLK encoding, k-anonymity, frequency-attack tradeoffs), read Privacy-preserving record linkage. For the raw endpoints, see the PPRL Linkage API.

Before you start

You need two sources already ingested into the platform. PPRL Mode A runs server-side over data the platform holds, so add both datasets first:

  • Upload a CSV, or connect a SaaS source — see Source connectors.
  • The two sources must share at least one column name that can be linked on (e.g. email, name, phone). The linkage intersects your requested fields with the columns both sources actually have.

Run a linkage

Open /pprl and fill in the form:

  1. Party A and Party B — pick the two sources from the dropdowns. They must be different; each option shows the source name, type, and row count.
  2. Link fields (optional) — a comma-separated list like name, email, phone. Leave it blank to let the engine auto-configure the fields from Party A's schema shape. Either way, only fields present in both sources are used.
  3. Threshold (default 0.85) — the similarity cutoff from 0.0 to 1.0. Higher is stricter: fewer matches, higher confidence. Lower catches more fuzzy matches (typos, nicknames) at the cost of more false positives.
  4. Click Run.

Behind the form, the run generates a fresh per-run key, encodes each side's link fields into CLK bit vectors under it, and compares them. The key is never stored or logged — it lives only for that run.

Read the results

The result panel reports:

  • Match count — how many cross-party match clusters were found.
  • Total comparisons — candidate pairs evaluated after blocking.
  • Linked on — the fields the run actually keyed on (or "auto-configured").

Click Download mapping (CSV) to export the matches. The CSV has exactly three columns — cluster_id, party, row — where row is the 0-based position in each source (ordered as ingested). There are no source values in the output: it's a pure index mapping you join back to your own copy of each dataset to materialize the matched records on your side.

cluster_id,party,row
0,party_a,14
0,party_b,203
1,party_a,57
1,party_b,88

Tuning

  • Too few matches? Lower the threshold (e.g. 0.80) or add more link fields so partial-but-real matches clear the bar.
  • Too many false matches? Raise the threshold or link on more-distinguishing fields (email beats first-name-only).
  • Unsure which fields? Leave the field list blank and let auto-config propose them from the schema, then refine.

Advanced: write matches to the identity graph

The form has a write-edges toggle. When enabled, the run's cross-party matches are also persisted into the identity graph as edges tagged kind='pprl', kept provably separate from the resolve graph.

This is opt-in and gated by a server-side flag (off by default) — if the flag is off, the run still succeeds and the linkage report is unaffected; the edge write is simply skipped. It's an advanced, additive capability; most linkage runs don't need it.

Limits

  • Quota: Free plans get 5 runs/day, Pro 50/day (UTC). Hitting the cap surfaces the upgrade prompt.
  • Rate limit: 30 runs/hour.
  • Mode A only: the platform holds both sources. The fully-federated Mode B — where each party encodes locally and uploads only the encoded vectors — is a documented follow-on, not yet shipped.
Was this page helpful?
Edit this page on GitHub