PPRL Linkage
REST endpoints for privacy-preserving record linkage (PPRL). Link two sources on encoded identifiers without exposing raw PII — ephemeral per-run keys, PII-free match reports.
PPRL Linkage
Endpoints for privacy-preserving record linkage — matching records across two sources on encoded identifiers (cryptographic long-term keys / Bloom filters) instead of raw values. Use it when the two datasets cannot share plaintext PII but you still need to know which records refer to the same entity.
This is the API behind the /pprl linkage page. For the theory, see Privacy-preserving record linkage.
How it works (Mode A)
Phase 1 ships Mode A: the platform already holds both sources (you ingested them via the Sources API), so it runs the whole linkage server-side:
- Load both sources from
golden_store.raw_records. - Resolve the link fields — either the ones you pass, or auto-configured from the schema shape — then intersect them with the columns both sources actually have.
- Generate an ephemeral per-run HMAC key (
secrets.token_hex(32)). This key is never persisted and never logged — it exists only for the duration of the run. - Encode each side's link fields into CLK bit vectors under that key and compare them (Jaccard over the bits).
- Return a PII-free match report: cluster membership by
party+ row index. Raw field values never appear in the response or in storage.
Note: Mode B — where each party computes its own CLKs locally and uploads only the encoded
PartyData, so the platform never sees either side's raw values — is a documented follow-on and is not implemented in Phase 1.
Base URL
https://backend-production-5c86.up.railway.app
All endpoints require a Clerk JWT Bearer token ([AUTH]). See Authentication.
POST /api/pprl/run [AUTH] [RATE-LIMITED 30/hr]
Run a two-party CLK linkage between two sources and return a PII-free match summary. The run is also persisted (see GET /runs).
Request
{
"source_a_id": "0b4f...",
"source_b_id": "9c12...",
"fields": ["name", "email", "phone"],
"threshold": 0.85,
"write_edges": false
}
| Field | Type | Required | Description |
|---|---|---|---|
source_a_id | string | Yes | UUID of the first source (party A) |
source_b_id | string | Yes | UUID of the second source (party B) |
fields | string[] | null | No | Columns to link on. Omit / null → auto-configured from party A's schema. Either way, fields are intersected with both sources' columns. |
threshold | number | No | Similarity cutoff, 0.0–1.0 (default 0.85). Higher = stricter (fewer, more-confident matches). |
write_edges | boolean | No | When true, persist cross-party matches as graph edges. Opt-in and gated — see the note below (default false). |
Response 200 OK
{
"run_id": "5f8c...",
"match_count": 182,
"total_comparisons": 240,
"wrote_edges": false,
"fields": ["name", "email", "phone"],
"summary": {
"match_count": 182,
"total_comparisons": 240,
"matches": [
{
"cluster_id": 0,
"members": [
{ "party": "party_a", "row": 14 },
{ "party": "party_b", "row": 203 }
]
}
]
}
}
| Field | Type | Description |
|---|---|---|
run_id | string | UUID of the persisted run; use it with GET /runs/{id} and /download |
match_count | integer | Number of cross-party match clusters found |
total_comparisons | integer | Candidate pairs evaluated after blocking |
wrote_edges | boolean | Whether graph edges were actually written (see note) |
fields | string[] | The fields the linkage actually keyed on (after auto-config + intersection) |
summary.matches | array | Per-cluster membership, identified only by party (party_a / party_b) and 0-based row index — no raw values |
Note:
match_countandrowindices are the only identifying output. Therowis the 0-based position in each source'sraw_records(ordered byrow_number), so you can join back to your own copy of the data — but the linkage never returns the underlying field values.
Errors
| Status | Body | Meaning |
|---|---|---|
400 Bad Request | {detail: "..."} | PprlInputError — a source is empty, or the two sources share no usable link field. The detail string is human-readable. |
402 Payment Required | {detail: {error: "quota_exceeded", gate: "pprl_runs_per_day", limit, current}} | Hit the plan's daily run cap (Free 5/day, Pro 50/day) |
429 Too Many Requests | — | Exceeded 30/hour rate limit |
500 Internal Server Error | {detail: "..."} | Unexpected failure (logged + audited) |
Tip: For org callers, every run emits a
pprl.runrow in theaudit_log(pprl.run.failedon error). The audit record carriesmatch_count,fields, and the source ids — never raw data or the HMAC key.
Edge writes (write_edges)
write_edges: true asks the linkage to persist its cross-party matches into the identity graph as entity_edges rows tagged kind='pprl' (see The identity graph). This is double-gated:
- It only happens when you pass
write_edges: true, and - the server-side
IDENTITY_GRAPH_EDGES_ENABLEDflag is on (off by default).
When the flag is off, the request succeeds and the linkage report is unaffected — wrote_edges simply comes back false and the skip is logged. Edge writing is fail-soft: any error sets wrote_edges: false and never fails the linkage report. Match clusters larger than PPRL_MAX_CLUSTER_SIZE (default 20) are skipped for edge translation.
List the caller's PPRL runs, newest first.
Query parameters
| Param | Type | Default | Description |
|---|---|---|---|
limit | integer | 50 | Page size, 1–200 |
offset | integer | 0 | Rows to skip |
Response 200 OK
{
"runs": [
{
"id": "5f8c...",
"org_id": "0123...",
"created_by": "user_2abc...",
"source_a_id": "0b4f...",
"source_b_id": "9c12...",
"fields": ["name", "email", "phone"],
"threshold": 0.85,
"status": "completed",
"match_count": 182,
"total_comparisons": 240,
"summary": { "match_count": 182, "total_comparisons": 240, "matches": [] },
"matches": [],
"wrote_edges": false,
"error": null,
"created_at": "2026-06-13T05:00:00Z"
}
]
}
Visibility mirrors Destinations: a run is visible when created_by is you or its org_id matches your org. Solo users (no org) only ever see their own runs.
Fetch a single run by id. Same shape as one element of the runs array above.
Errors
| Status | Meaning |
|---|---|
404 Not Found | Run doesn't exist, or isn't visible to the caller |
Download the run's match clusters as a CSV. One row per cluster member.
Response 200 OK — text/csv with Content-Disposition: attachment; filename="pprl_run_{id}.csv":
cluster_id,party,row
0,party_a,14
0,party_b,203
1,party_a,57
1,party_b,88
The CSV is the PII-free mapping — party + 0-based row index only. Join it back to your own copy of each source to materialize the matched records on your side.
Errors
| Status | Meaning |
|---|---|
404 Not Found | Run doesn't exist, or isn't visible to the caller |
Quotas
PPRL runs are metered per day (UTC) against the org's plan:
| Plan | Runs / day |
|---|---|
| Free | 5 |
| Pro | 50 |
Hitting the cap returns 402 with {gate: "pprl_runs_per_day", limit, current}. The frontend client maps this to the standard quota-exceeded window event, so the upgrade modal fires automatically.
TypeScript client
The repo ships a thin typed client at frontend/lib/pprl.ts:
import { runLinkage, downloadPprlMapping, PprlInputError } from '@/lib/pprl'
try {
const result = await runLinkage(token, {
source_a_id: sourceA,
source_b_id: sourceB,
// fields omitted → auto-configured from schema shape
threshold: 0.85,
})
console.log(`${result.match_count} matches on ${result.fields.join(', ')}`)
// Pull the PII-free cluster -> (party, row) mapping as a CSV download
await downloadPprlMapping(token, result.run_id)
} catch (e) {
if (e instanceof PprlInputError) {
// 400 — surface e.message (e.g. "No shared link fields ...")
}
}
runLinkage handles the 402 → quota-exceeded dispatch automatically, so <QuotaExceededModal /> lights up without per-callsite wiring.