Privacy-preserving record linkage (PPRL)
PPRL matches records across organizations without revealing raw identifiers. Bloom-filter encoding, k-anonymity, and where each fits.
Privacy-preserving record linkage (PPRL) is the problem of linking records across organizations without either side seeing the other side's raw identifiers. It comes up in healthcare (cross-hospital patient matching), banking (sanctions screening), and any federated-data setting where the underlying data cannot legally cross an organizational boundary.
Bloom-filter encoding
The most widely used technique. Each party converts identifiers (name, DOB, address) into a fixed-length bit array by hashing q-grams with multiple hash functions. A name like "SMITH" becomes a vector of set bits; "SMYTH" produces a similar but not identical vector. Comparing two Bloom filters with the Dice or Jaccard similarity gives a fuzzy match score without ever revealing the original strings.
Tunable parameters:
- Filter length: typically 512 to 2048 bits. Longer filters reduce false positives.
- Number of hash functions: controls the fill rate of the bit array.
- Q-gram size: 2 or 3 characters is typical. Bigrams catch more typos; trigrams are more precise.
Schnell et al. (2009) introduced the canonical "cryptographic long-term key" (CLK) variant, which is still the reference implementation most production systems are measured against.
Other building blocks
- K-anonymity: bucket records so every quasi-identifier combination appears in at least k records before sharing. Limits re-identification risk on the output side.
- Secure multi-party computation: heavier and slower than Bloom filters, but allows exact-match computation with cryptographic guarantees. Used where filter-based approximations are not acceptable.
- Differential privacy: adds calibrated noise to aggregate statistics. Not a record-linkage primitive itself, but often layered on the output of a PPRL pipeline to bound information leakage from query answers.
Tradeoffs
Bloom-filter PPRL is fast and scales to millions of records, but it is vulnerable to frequency attacks if filter parameters are chosen poorly. The literature has well-documented attacks and mitigations. The safest production deployments use salted CLK with regularly rotated keys and capped query rates. Getting the parameter choices wrong does not break matching silently; it creates a false sense of privacy.
In Golden Suite
PPRL is live. The CLK / Bloom-filter encoder is wired into a two-source linkage you can run from the /pprl page or the PPRL Linkage API: pick two ingested sources, pick (or auto-configure) the link fields, and get back a PII-free match report — cluster membership by party + row index, with the encoding key generated fresh per run and never persisted.
Phase 1 ships Mode A (the platform holds both sources and runs the linkage server-side). Mode B — each party encodes locally and uploads only the encoded vectors, so the platform never sees raw values — is a documented follow-on. For cross-org linkage under HIPAA or GDPR constraints, or to discuss Mode B, reach out via the enterprise intake form.
- Try it: Link two datasets without sharing PII
- API: PPRL Linkage
Related
- Record linkage (theory hub)
- Similarity metrics for record matching
- The Fellegi-Sunter model
- The identity graph