Privacy-preserving record linkage (PPRL)

PPRL matches records across organizations without revealing raw identifiers. Bloom-filter encoding, k-anonymity, and where each fits.

Privacy-preserving record linkage (PPRL) is the problem of linking records across organizations without either side seeing the other side's raw identifiers. It comes up in healthcare (cross-hospital patient matching), banking (sanctions screening), and any federated-data setting where the underlying data cannot legally cross an organizational boundary.

Bloom-filter encoding

The most widely used technique. Each party converts identifiers (name, DOB, address) into a fixed-length bit array by hashing q-grams with multiple hash functions. A name like "SMITH" becomes a vector of set bits; "SMYTH" produces a similar but not identical vector. Comparing two Bloom filters with the Dice or Jaccard similarity gives a fuzzy match score without ever revealing the original strings.

Tunable parameters:

Filter length: typically 512 to 2048 bits. Longer filters reduce false positives.
Number of hash functions: controls the fill rate of the bit array.
Q-gram size: 2 or 3 characters is typical. Bigrams catch more typos; trigrams are more precise.

Schnell et al. (2009) introduced the canonical "cryptographic long-term key" (CLK) variant, which is still the reference implementation most production systems are measured against.

Other building blocks

K-anonymity: bucket records so every quasi-identifier combination appears in at least k records before sharing. Limits re-identification risk on the output side.
Secure multi-party computation: heavier and slower than Bloom filters, but allows exact-match computation with cryptographic guarantees. Used where filter-based approximations are not acceptable.
Differential privacy: adds calibrated noise to aggregate statistics. Not a record-linkage primitive itself, but often layered on the output of a PPRL pipeline to bound information leakage from query answers.

Tradeoffs

Bloom-filter PPRL is fast and scales to millions of records, but it is vulnerable to frequency attacks if filter parameters are chosen poorly. The literature has well-documented attacks and mitigations. The safest production deployments use salted CLK with regularly rotated keys and capped query rates. Getting the parameter choices wrong does not break matching silently; it creates a false sense of privacy.

In Golden Suite

PPRL is live. The CLK / Bloom-filter encoder is wired into a two-source linkage you can run from the /pprl page or the PPRL Linkage API: pick two ingested sources, pick (or auto-configure) the link fields, and get back a PII-free match report — cluster membership by party + row index, with the encoding key generated fresh per run and never persisted.

Phase 1 ships Mode A (the platform holds both sources and runs the linkage server-side). Mode B — each party encodes locally and uploads only the encoded vectors, so the platform never sees raw values — is a documented follow-on. For cross-org linkage under HIPAA or GDPR constraints, or to discuss Mode B, reach out via the enterprise intake form.

Try it: Link two datasets without sharing PII
API: PPRL Linkage

Was this page helpful?

Edit this page on GitHub

PreviousClustering in record linkage and entity resolution NextThe identity graph

Privacy-preserving record linkage (PPRL)

Bloom-filter encoding

Other building blocks

Tradeoffs

In Golden Suite

Related