Clean room: how it protects your data, and how accurate it is
How the /cleanroom tool matches two lists on encrypted fingerprints, the two privacy tiers (your data never leaves your browser, or hashed-and-discarded on our server), the accuracy you can expect, and the proof it's clean.
Clean room: how it protects your data, and how accurate it is
The clean room compares two lists for overlap without either side handing over its list. This guide explains the part that matters before you trust it with real data: what the server can and cannot see, how accurate the match is, and what is enforced in code rather than just promised. For the click-by-click walkthrough, see Compare two lists for overlap.
How a clean room works
A value never travels. Its fingerprint does.
When you match a list, each side turns every value (an email, a phone number, a name) into a fixed-size bit pattern called a CLK (a cryptographic long-term key). The same value always produces the same fingerprint, and a similar value produces a similar one, which is what lets the match survive typos and formatting drift. The server compares the fingerprints, measures how much they overlap, and never sees a single raw value.
The recipe is the same on both sides, pinned per room so the fingerprints line up:
Each field is cleaned by type. A phone collapses to digits, a name strips
punctuation and extra spaces, email and generic lowercase and trim, and
postal keeps the first five digits of a US ZIP. Both sides normalize identically,
so (212) 555-1212 and 212.555.1212 end up the same before anything is hashed.
The normalized text is chopped into character bigrams (overlapping two-character slices) and each bigram is run through 30 keyed hash functions, setting bits in a 1024-bit Bloom filter. Two values that share most of their bigrams share most of their bits.
The fingerprint is a string of hex. That, your 0-based row numbers, and the column names you mapped are the only things uploaded. No values leave in readable form.
The server measures the bit overlap between every pair of fingerprints (Dice similarity) and reports how many rows are on both lists. It is comparing bit patterns, not data.
Because both parties use the same pinned recipe, identical values produce identical fingerprints and near-identical values produce near-identical ones. That is the whole trick: comparable fingerprints, no comparable data.
Two privacy tiers
There are two ways to run a room, and they differ in one thing: whether your raw data ever reaches our server at all.
| What the server can do | Browser ("true clean room") | Server |
|---|---|---|
| Sees your raw rows | Never. Hashing happens in your browser. | For milliseconds, in memory, to compute fingerprints. |
| Stores your raw rows | Never. | Never. Raw is discarded the instant fingerprints are computed; nothing is written to disk. |
| Holds the match key | No. The key is minted in your browser and lives only in your link. | The room key is server-side for this tier. |
| What it keeps | Fingerprint hex, row numbers, column names. | Fingerprint hex, row numbers, column names. |
| Free up to | 10,000 rows per side | 50,000 rows per side |
In the browser tier, your file is parsed and hashed entirely on your device (in
a Web Worker, so the page stays responsive). Only the fingerprint hex, row numbers,
and column names are uploaded. The per-room key is generated in your browser and
carried only in the # fragment of your room link, which browsers never send to a
server. We never hold that key. This is the literal "your raw data never leaves your
device" tier.
In the server tier, you upload your file, the server computes fingerprints in the request, and the raw frame is discarded immediately. Raw is never persisted. The promise here is "we never store your data," and it scales to larger lists.
The tradeoff is plain: the browser tier is maximum privacy on smaller lists; the server tier sees your raw rows for milliseconds, then drops them, in exchange for handling bigger lists. Both store the same minimal residue (hex, indices, column names), and both let each side download only its own matched row numbers.
Accuracy
The clean room runs the same matching engine the rest of the product uses. So when we talk about accuracy, we are talking about that engine's measured benchmark, not a clean-room-specific eval we never ran.
On a synthetic-but-realistic multi-source CRM benchmark (467 records, 496 true match pairs across 180 people, with nicknames, typos, maiden and married surnames, work-versus-personal email, and phone-format drift) the matcher the clean room runs on scores F1 0.84. On a cleaner single-source synthetic set (Febrl, 500 records) it scores 0.83. Those are real precision-and-recall numbers against ground truth, not vibes. For context, a bare zero-config matcher over-merges that same messy CRM shape down to F1 0.13 (precision 0.07), which is exactly why every match routes through the curated, column-aware config instead.
The threshold slider on the results is your precision/recall dial. Raise it for fewer, surer matches (higher precision, you keep only near-exact fingerprint overlaps). Lower it for more, looser matches (higher recall, you catch fuzzier pairs at the cost of some false ones). The matched-rows download always follows whatever threshold you have set.
A few things degrade accuracy, and they are worth knowing:
- Typos and missing fields in the matched column. A fingerprint can only match on what is there. Blank or garbled values in the column you match on cost recall. Add a second field (last name alongside email) to recover real matches a single field misses.
- Shared-context inflation. If every value in a column shares a chunk (say every
email is on
@acme.com), those shared bigrams set the same bits in every fingerprint, so at a low threshold rows over-match on the shared part. Slide the threshold up until the overlap reflects the distinguishing parts of each value.
A tiny worked example. Two lists of five, three of which are truly the same person:
List A List B
jane@acme.com JANE@ACME.COM <- same (casing normalized away)
bob@acme.com bob@acme.com <- same
carol@acme.com carol@acme.com <- same
dave@acme.com erin@acme.com
frank@acme.com grace@acme.com
At a high threshold you get exactly the three real overlaps. The casing difference on
Jane does not break the match, because normalization lowercased both sides before
fingerprinting. The two non-overlapping rows on each side stay unmatched, even though
every address shares @acme.com, as long as the threshold is set to weigh the
distinguishing local-part over the shared domain.
Proof it's clean
The privacy claims are enforced in code, not just stated here.
- No raw value is ever stored. A test,
test_no_raw_pii_persisted, plants a known raw value in an upload and asserts it appears in none of what we keep: the fingerprints, the column mapping, the quality report, the result pairs, or the sensitivity histogram. If a code change ever let a raw value leak into storage, that test fails the build. You can read it here:test_cleanroom_router.py. - The browser fingerprint is byte-for-byte identical to the server's. The JavaScript encoder that runs in your browser produces the exact same bytes as the Python one, checked in CI on every change against a committed fixture. The privacy upgrade of the browser tier costs you no accuracy, because it is provably the same computation.
- It is not reversible without the key. The standard attack on fingerprints is to
guess plausible values, hash them, and look for matches. That fails here because the
30 hash functions share one per-room secret key, distinguished only by an index:
each is
HMAC-SHA256(key="<roomkey>:<k>", msg=bigram)forkfrom 0 to 29. Without the room secret you cannot reproduce any fingerprint, and in the browser tier the server never holds that secret at all (it lives only in your link fragment).
The same approach extends, in the future, to running the encoder inside your own infrastructure so even the fingerprinting never touches our servers.
Related
- Compare two lists for overlap, the step-by-step walkthrough
- Privacy-preserving record linkage (concept)
- Link two datasets without sharing PII, the enterprise pre-ingested flow