PPRL Linkage

REST endpoints for privacy-preserving record linkage (PPRL). Link two sources on encoded identifiers without exposing raw PII — ephemeral per-run keys, PII-free match reports.

PPRL Linkage

Endpoints for privacy-preserving record linkage — matching records across two sources on encoded identifiers (cryptographic long-term keys / Bloom filters) instead of raw values. Use it when the two datasets cannot share plaintext PII but you still need to know which records refer to the same entity.

This is the API behind the /pprl linkage page. For the theory, see Privacy-preserving record linkage.

How it works (Mode A)

Phase 1 ships Mode A: the platform already holds both sources (you ingested them via the Sources API), so it runs the whole linkage server-side:

Load both sources from golden_store.raw_records.
Resolve the link fields — either the ones you pass, or auto-configured from the schema shape — then intersect them with the columns both sources actually have.
Generate an ephemeral per-run HMAC key (secrets.token_hex(32)). This key is never persisted and never logged — it exists only for the duration of the run.
Encode each side's link fields into CLK bit vectors under that key and compare them (Jaccard over the bits).
Return a PII-free match report: cluster membership by party + row index. Raw field values never appear in the response or in storage.

Note: Mode B — where each party computes its own CLKs locally and uploads only the encoded PartyData, so the platform never sees either side's raw values — is a documented follow-on and is not implemented in Phase 1.

Base URL

https://backend-production-5c86.up.railway.app

All endpoints require a Clerk JWT Bearer token ([AUTH]). See Authentication.

`POST /api/pprl/run` [AUTH] [RATE-LIMITED 30/hr]

Run a two-party CLK linkage between two sources and return a PII-free match summary. The run is also persisted (see GET /runs).

Request

{
  "source_a_id": "0b4f...",
  "source_b_id": "9c12...",
  "fields": ["name", "email", "phone"],
  "threshold": 0.85,
  "write_edges": false
}

Field	Type	Required	Description
`source_a_id`	string	Yes	UUID of the first source (party A)
`source_b_id`	string	Yes	UUID of the second source (party B)
`fields`	string[] \| null	No	Columns to link on. Omit / `null` → auto-configured from party A's schema. Either way, fields are intersected with both sources' columns.
`threshold`	number	No	Similarity cutoff, `0.0`–`1.0` (default `0.85`). Higher = stricter (fewer, more-confident matches).
`write_edges`	boolean	No	When `true`, persist cross-party matches as graph edges. Opt-in and gated — see the note below (default `false`).

Response 200 OK

{
  "run_id": "5f8c...",
  "match_count": 182,
  "total_comparisons": 240,
  "wrote_edges": false,
  "fields": ["name", "email", "phone"],
  "summary": {
    "match_count": 182,
    "total_comparisons": 240,
    "matches": [
      {
        "cluster_id": 0,
        "members": [
          { "party": "party_a", "row": 14 },
          { "party": "party_b", "row": 203 }
        ]
      }
    ]
  }
}

Field	Type	Description
`run_id`	string	UUID of the persisted run; use it with `GET /runs/{id}` and `/download`
`match_count`	integer	Number of cross-party match clusters found
`total_comparisons`	integer	Candidate pairs evaluated after blocking
`wrote_edges`	boolean	Whether graph edges were actually written (see note)
`fields`	string[]	The fields the linkage actually keyed on (after auto-config + intersection)
`summary.matches`	array	Per-cluster membership, identified only by `party` (`party_a` / `party_b`) and 0-based `row` index — no raw values

Note: match_count and row indices are the only identifying output. The row is the 0-based position in each source's raw_records (ordered by row_number), so you can join back to your own copy of the data — but the linkage never returns the underlying field values.

Errors

Status	Body	Meaning
`400 Bad Request`	`{detail: "..."}`	`PprlInputError` — a source is empty, or the two sources share no usable link field. The `detail` string is human-readable.
`402 Payment Required`	`{detail: {error: "quota_exceeded", gate: "pprl_runs_per_day", limit, current}}`	Hit the plan's daily run cap (Free 5/day, Pro 50/day)
`429 Too Many Requests`	—	Exceeded `30/hour` rate limit
`500 Internal Server Error`	`{detail: "..."}`	Unexpected failure (logged + audited)

Tip: For org callers, every run emits a pprl.run row in the audit_log (pprl.run.failed on error). The audit record carries match_count, fields, and the source ids — never raw data or the HMAC key.

Edge writes (`write_edges`)

write_edges: true asks the linkage to persist its cross-party matches into the identity graph as entity_edges rows tagged kind='pprl' (see The identity graph). This is double-gated:

It only happens when you pass write_edges: true, and
the server-side IDENTITY_GRAPH_EDGES_ENABLED flag is on (off by default).

When the flag is off, the request succeeds and the linkage report is unaffected — wrote_edges comes back false and the skip is logged. Edge writing is fail-soft: any error sets wrote_edges: false and never fails the linkage report. Match clusters larger than PPRL_MAX_CLUSTER_SIZE (default 20) are skipped for edge translation.

List the caller's PPRL runs, newest first.

Query parameters

Param	Type	Default	Description
`limit`	integer	`50`	Page size, `1`–`200`
`offset`	integer	`0`	Rows to skip

Response 200 OK

{
  "runs": [
    {
      "id": "5f8c...",
      "org_id": "0123...",
      "created_by": "user_2abc...",
      "source_a_id": "0b4f...",
      "source_b_id": "9c12...",
      "fields": ["name", "email", "phone"],
      "threshold": 0.85,
      "status": "completed",
      "match_count": 182,
      "total_comparisons": 240,
      "summary": { "match_count": 182, "total_comparisons": 240, "matches": [] },
      "matches": [],
      "wrote_edges": false,
      "error": null,
      "created_at": "2026-06-13T05:00:00Z"
    }
  ]
}

Visibility mirrors Destinations: a run is visible when created_by is you or its org_id matches your org. Solo users (no org) only ever see their own runs.

Fetch a single run by id. Same shape as one element of the runs array above.

Errors

Status	Meaning
`404 Not Found`	Run doesn't exist, or isn't visible to the caller

Download the run's match clusters as a CSV. One row per cluster member.

Response 200 OK — text/csv with Content-Disposition: attachment; filename="pprl_run_{id}.csv":

cluster_id,party,row
0,party_a,14
0,party_b,203
1,party_a,57
1,party_b,88

The CSV is the PII-free mapping — party + 0-based row index only. Join it back to your own copy of each source to materialize the matched records on your side.

Errors

Status	Meaning
`404 Not Found`	Run doesn't exist, or isn't visible to the caller

Quotas

PPRL runs are metered per day (UTC) against the org's plan:

Plan	Runs / day
Free	5
Pro	50

Hitting the cap returns 402 with {gate: "pprl_runs_per_day", limit, current}. The frontend client maps this to the standard quota-exceeded window event, so the upgrade modal fires automatically.

TypeScript client

The repo ships a thin typed client at frontend/lib/pprl.ts:

import { runLinkage, downloadPprlMapping, PprlInputError } from '@/lib/pprl'

try {
  const result = await runLinkage(token, {
    source_a_id: sourceA,
    source_b_id: sourceB,
    // fields omitted → auto-configured from schema shape
    threshold: 0.85,
  })
  console.log(`${result.match_count} matches on ${result.fields.join(', ')}`)

  // Pull the PII-free cluster -> (party, row) mapping as a CSV download
  await downloadPprlMapping(token, result.run_id)
} catch (e) {
  if (e instanceof PprlInputError) {
    // 400 — surface e.message (e.g. "No shared link fields ...")
  }
}

runLinkage handles the 402 → quota-exceeded dispatch automatically, so <QuotaExceededModal /> lights up without per-callsite wiring.

Was this page helpful?

Edit this page on GitHub

PreviousEntities NextWorkbench API

PPRL Linkage

PPRL Linkage

How it works (Mode A)

Base URL

POST /api/pprl/run [AUTH] [RATE-LIMITED 30/hr]

Edge writes (write_edges)

Quotas

TypeScript client

`POST /api/pprl/run` [AUTH] [RATE-LIMITED 30/hr]

Edge writes (`write_edges`)