Privacy-preserving record linkage (PPRL)

PPRL matches records across organizations without revealing raw identifiers. Bloom-filter encoding, k-anonymity, and where each fits.

Privacy-preserving record linkage (PPRL) is the problem of linking records across organizations without either side seeing the other side's raw identifiers. It comes up in healthcare (cross-hospital patient matching), banking (sanctions screening), and any federated-data setting where the underlying data cannot legally cross an organizational boundary.

Bloom-filter encoding

The most widely used technique. Each party converts identifiers (name, DOB, address) into a fixed-length bit array by hashing q-grams with multiple hash functions. A name like "SMITH" becomes a vector of set bits; "SMYTH" produces a similar but not identical vector. Comparing two Bloom filters with the Dice or Jaccard similarity gives a fuzzy match score without ever revealing the original strings.

Tunable parameters:

  • Filter length: typically 512 to 2048 bits. Longer filters reduce false positives.
  • Number of hash functions: controls the fill rate of the bit array.
  • Q-gram size: 2 or 3 characters is typical. Bigrams catch more typos; trigrams are more precise.

Schnell et al. (2009) introduced the canonical "cryptographic long-term key" (CLK) variant, which is still the reference implementation most production systems are measured against.

Other building blocks

  • K-anonymity: bucket records so every quasi-identifier combination appears in at least k records before sharing. Limits re-identification risk on the output side.
  • Secure multi-party computation: heavier and slower than Bloom filters, but allows exact-match computation with cryptographic guarantees. Used where filter-based approximations are not acceptable.
  • Differential privacy: adds calibrated noise to aggregate statistics. Not a record-linkage primitive itself, but often layered on the output of a PPRL pipeline to bound information leakage from query answers.

Tradeoffs

Bloom-filter PPRL is fast and scales to millions of records, but it is vulnerable to frequency attacks if filter parameters are chosen poorly. The literature has well-documented attacks and mitigations. The safest production deployments use salted CLK with regularly rotated keys and capped query rates. Getting the parameter choices wrong does not break matching silently; it creates a false sense of privacy.

In Golden Suite

PPRL is on the Enterprise-tier roadmap. The core matching engine already separates "compare two encoded vectors" from "compare two raw strings," so the encoding swap is well-scoped. If your use case involves cross-org linkage under HIPAA or GDPR constraints, reach out via the enterprise intake form.