How to match two lists without either side seeing the other (privacy-preserving record linkage)
A technical look at how the /cleanroom tool finds the overlap between two lists using Bloom-filter fingerprints (CLKs), why the server can't reverse them, and how the browser tier keeps raw data on your device.
A bank and a fintech partner both have a list of customers. They suspect a lot of overlap, the same people on both sides, and they want to know how much. The catch is the obvious one: neither is allowed to send the other its customer list. Not a sample, not a hashed dump, nothing. Legal will not sign off, and they are right not to. So how do you count the people on both lists when you are not allowed to compare the lists?
That question is the whole field of privacy-preserving record linkage (PPRL), and it is exactly what the clean room does. Two parties upload their own files, the tool reports how many records they share, and at no point does either side learn anything about the other's rows beyond the count and which of its own rows matched. This post is the technical walk through how that actually works, why the naive versions fail, and the part most explanations skip: why the server cannot reverse what it stores.
Why the obvious answers don't work
Start with the answers an engineer reaches for first, in order, and watch each one die.
Just exchange the lists. No. That is the thing you are not allowed to do. Next.
Exchange hashes instead of raw values. Each side runs SHA256(email) and sends the digests; equal digests mean equal emails, and you never sent a plaintext email. This feels safe and is not. Emails, phone numbers, and names are low-entropy. There are only so many plausible emails, and an attacker who receives your hashes can simply hash every address in a leaked breach corpus (or every firstname.lastname@bigco.com combination) and look for collisions. This is a dictionary attack, and against a hash of jane.smith@acme.com it takes seconds. Unsalted hashing of PII is reversible in practice. It is not encryption, it is an index.
Salt the hashes. Add a shared secret salt before hashing, so SHA256(salt + email). Now the attacker needs the salt, which helps. But you have two new problems. First, both sides need the same salt, so you are back to a key-exchange problem and the salt becomes the thing worth stealing. Second, and fatally for our use case, salted hashing is still exact-only. jane.smith@acme.com and Jane.Smith@Acme.com produce completely different digests. (212) 555-1212 and 212.555.1212 produce completely different digests. Real lists are full of that drift, and exact hashing throws away every fuzzy match. You would report an overlap far smaller than the truth and call it privacy.
So the requirements that survive contact with reality are: the comparison must tolerate typos and formatting differences (fuzzy, not exact), and it must resist the guess-and-hash dictionary attack (keyed, not just hashed). That combination is what a CLK gives you.
CLKs: a fingerprint that is fuzzy and keyed
A CLK (cryptographic long-term key) is a Bloom filter, a fixed-length array of bits, that encodes a value in a way that is both fuzzy-comparable and keyed. The clean room builds one per row. Here is the recipe it uses, and these are the real numbers from the running code, not illustrative ones.
First, normalize by field type, identically on both sides. A phone collapses to digits, a name strips punctuation and extra whitespace, email and generic lowercase and trim, postal keeps the first five ZIP digits. This is where the formatting drift dies: (212) 555-1212 and 212.555.1212 both become 2125551212 before anything is hashed, so they will produce identical fingerprints.
Then break the normalized text into character bigrams, overlapping two-character slices. The string smith becomes sm, mi, it, th. Bigrams are the trick that makes the fingerprint fuzzy: two strings that are similar share most of their bigrams, so they will set most of the same bits, even if they are not identical. smith and smyth share sm and th; a typo costs you a couple of bits, not the whole match.
Then set bits. For each bigram, run it through 30 keyed hash functions and set the bit each one points at, in a 1024-bit filter. A similar value sets a mostly-overlapping set of bits.
# the CLK recipe, per value (1024-bit filter, bigrams, 30 keyed hashes)
def clk_from_text(text, room_key, filter_size=1024, k=30):
bits = bytearray(filter_size // 8)
prepared = text.lower().strip()
for i in range(len(prepared) - 1):
bigram = prepared[i:i+2]
for j in range(k):
# ALL 30 functions share the ONE room key, indexed by j.
h = hmac_sha256(key=f"{room_key}:{j}".encode(), msg=bigram.encode())
bit_pos = int.from_bytes(h, "big") % filter_size
bits[bit_pos >> 3] |= 1 << (bit_pos & 7) # LSB-first within the byte
return bits
That f"{room_key}:{j}" line is the part to read twice, because it is the part everyone gets wrong when they describe this. The 30 hash functions are not 30 independent keys. They share the one per-room secret key and are distinguished only by an index suffix, :0 through :29. The index is just a cheap way to get 30 different-but-deterministic hash functions out of a single HMAC primitive and a single secret. The non-reversibility, which we will get to, comes entirely from the room secret, not from there being 30 of them.
To compare two fingerprints, the server measures their bit overlap with the Dice coefficient: twice the shared set bits over the total set bits on both sides. Identical values score 1.0. A typo'd surname scores high but not perfect. Two unrelated values score near zero. Run that over every pair and you have recovered the fuzzy matches that exact hashing threw away, while the server has only ever touched bit vectors.
Both parties pin the same recipe, the same filter size, bigram size, hash count, and the same room key, so their fingerprints are directly comparable. That pinning is what makes the whole thing work: same recipe in, comparable fingerprints out.
The browser tier: distribute the hashing, broker the match blind
There are two ways to run a room, and the interesting one moves the hashing off our server entirely.
In the browser tier, the encoder runs on your device. You pick your CSV, and a Web Worker in your browser parses it and computes the CLKs locally (off the main thread, so the page stays responsive). What gets uploaded is only the fingerprint hex, the 0-based row numbers, and the column names you mapped. Your raw values never enter a network request. They go to worker.postMessage and nowhere else.
The room key, the secret that makes the fingerprints unreproducible, is generated in your browser and carried only in the URL fragment, the part after the # in your room link. Browsers never transmit the fragment to the server; it is a client-side-only piece of the URL by spec. So when you share /cleanroom/<id>#k=<key> with the other party, they get the key, the server does not. On the server side the key column is literally null.
Put those together and the server's role is to broker a match between two sets of fingerprints it cannot read, using a key it does not have. It sees hex, row indices, and column names. That is the entire residue. The promise is not "we delete your data quickly"; it is "your data was never here."
This is not a claim you have to take on faith. There is a test, test_no_raw_pii_persisted, that plants a known raw value in an upload and asserts it appears in none of what the server stores: not the fingerprints, not the column mapping, not the quality report, not the result pairs, not the sensitivity histogram. If a future change ever let a raw value leak into storage, that test fails and the build is red.
The server tier exists too, for larger lists. There, you do upload your file, the server computes the fingerprints in the request, and the raw frame is discarded the instant it is done, never written to disk. That is a real and weaker promise, "we never store your data" rather than "we never see it," and it is the honest tradeoff for scale. The browser tier is the one to reach for when "never leaves my perimeter" is the requirement.
The part that makes it credible: byte-parity
Here is the engineering problem that sinks most "we moved the crypto to the browser" stories. If the JavaScript encoder in the browser produces even slightly different fingerprints than the Python one on the server, the two parties' CLKs will not line up and every match silently breaks. You would have a privacy feature that quietly returns the wrong overlap, which is worse than no feature.
So the JS encoder is byte-for-byte identical to the Python one, and that is gated in CI. There is a committed fixture of known inputs and their expected CLK bytes, and both the TypeScript implementation and the Python one are checked against it on every change. Same bigram order, same BigInt modulo (not floating-point), same least-significant-bit-first bit packing, same lowercase-before-trim. If either drifts, the parity test fails before the change can merge.
That is the whole point of measuring it: the privacy upgrade of the browser tier costs you zero accuracy, because it is provably the same computation, just run somewhere the server can't see. You are not trading correctness for privacy. You are getting the same match, computed on your own machine.
Accuracy, and the dial you actually turn
The clean room runs the same matching engine the rest of the product runs. When I quote accuracy, I am quoting that engine's benchmark, not a clean-room-specific eval, because we do not run a separate one and I am not going to invent a number.
That engine scores F1 0.84 on a synthetic-but-realistic multi-source CRM benchmark: 467 records, 496 true match pairs across 180 people, with the full mess of real data, nicknames, typos, maiden and married surnames, work-versus-personal email, phone-format drift. On a cleaner single-source synthetic set (Febrl, 500 records) it scores 0.83. Those are precision-and-recall numbers against ground truth. For contrast, a bare zero-config matcher over-merges that same messy shape down to F1 0.13 at precision 0.07, which is why the real config is curated and column-aware rather than "just dedupe it."
On the results screen, the lever you turn is the threshold slider, and it is the classic precision-recall dial. The fingerprints are compared by Dice similarity; the threshold is the cutoff. Slide it up and you keep only near-exact overlaps: fewer matches, higher confidence, higher precision. Slide it down and you accept fuzzier pairs: more matches, higher recall, more false positives. There is no universally correct setting; it depends on whether your cost is a missed match or a wrong one.
One honest gotcha worth knowing, because it shows up the first time you run real data. If every value in a column shares a chunk, say every email is on @acme.com, then every fingerprint sets the same bits for that shared ac, cm, me run of bigrams. At a low threshold, rows over-match on that shared context, and your overlap looks inflated. The fix is not magic: slide the threshold up until the match reflects the distinguishing part of each value (the local-part of the email) rather than the part everyone shares. It is the same lesson as in any matching: shared context is weak evidence, and the dial lets you discount it.
Why the server can't reverse it
Come back to the dictionary attack, because resisting it is the entire claim. An attacker who got hold of the stored fingerprints would want to do what worked against plain hashes: take a big list of plausible values, encode each one, and look for fingerprints that match. With a CLK, that attack needs to reproduce the encoding, and reproducing the encoding needs the room key, because every one of the 30 hash functions is HMAC-SHA256(key="<roomkey>:<k>", msg=bigram). Change the key and every bit position changes. Without the room secret, the attacker cannot generate a single correct fingerprint to compare against, so the stored hex is, to them, noise.
In the browser tier the server never holds that secret at all. It lives only in the URL fragment, on the two parties' devices. So even a full compromise of the server's stored data yields hex CLKs, row numbers, and column names, with no key to turn any of it back into values. The thing that defeats the attack is the secret the server chose never to hold.
Where this is headed (roadmap, not shipped)
Everything above this section is live today. Everything in it is direction, explicitly not a current capability. Treat it as where the architecture points, not what it does now.
The same encode-locally, match-blind pattern naturally extends past a hosted tool. The next step we are aiming at is a deployable engine you run inside your own VPC, a container or CLI that would do the fingerprinting entirely behind your own perimeter, so even the encoder never runs on infrastructure you do not control. Further out, we think PPRL deserves to be a category, not a feature bolted onto a matcher: the same keyed-fuzzy-fingerprint machinery should generalize from "how much do two lists overlap" toward comparing anonymized graph topology, asking whether two parties' relationship structures line up without either revealing the graph.
None of that is built yet. To be unambiguous: there is no VPC container or CLI today, and the clean room matches CLK list-overlap only, not graph structure. The road from "match two lists blind" to "compare two networks blind" looks like a straight one, and it is the one we intend to walk. But today the product is the list-overlap clean room described above, and nothing more.
Try it
You can run a real two-party match right now, no signup.
- Open the clean room and pick the true clean room (browser) tier. Your raw data never leaves your device; free up to 10,000 rows per side.
- Read the privacy model and proof in How it protects your data, and how accurate it is.
- Need this inside your own perimeter, or at enterprise scale? Talk to us.
Related posts
Why Knowledge Graphs Live or Die on Entity Resolution
A knowledge graph is only as good as its entities. Why bad entity resolution wrecks KG quality and cost, and how GoldenMatch solves the node layer.
2026-06-19
The OSS vuln-DB 'consensus' is a redistribution artifact
Cross-source remediation agreement looks near-perfect at 99-100%. De-duplicate the mirrors and it collapses to 70%. The lift is 1,898x.
2026-05-17
28 seeds, one corroborated lead: an Epstein-network investigation in public data
What an entity-resolution pipeline finds (and misses) when pointed at 28 publicly-sourced seeds from the Epstein corporate-network reporting.
2026-05-15