PPRL Linkage

REST endpoints for privacy-preserving record linkage (PPRL). Link two sources on encoded identifiers without exposing raw PII — ephemeral per-run keys, PII-free match reports.

PPRL Linkage

Endpoints for privacy-preserving record linkage — matching records across two sources on encoded identifiers (cryptographic long-term keys / Bloom filters) instead of raw values. Use it when the two datasets cannot share plaintext PII but you still need to know which records refer to the same entity.

This is the API behind the /pprl linkage page. For the theory, see Privacy-preserving record linkage.

How it works (Mode A)

Phase 1 ships Mode A: the platform already holds both sources (you ingested them via the Sources API), so it runs the whole linkage server-side:

  1. Load both sources from golden_store.raw_records.
  2. Resolve the link fields — either the ones you pass, or auto-configured from the schema shape — then intersect them with the columns both sources actually have.
  3. Generate an ephemeral per-run HMAC key (secrets.token_hex(32)). This key is never persisted and never logged — it exists only for the duration of the run.
  4. Encode each side's link fields into CLK bit vectors under that key and compare them (Jaccard over the bits).
  5. Return a PII-free match report: cluster membership by party + row index. Raw field values never appear in the response or in storage.

Note: Mode B — where each party computes its own CLKs locally and uploads only the encoded PartyData, so the platform never sees either side's raw values — is a documented follow-on and is not implemented in Phase 1.

Base URL

https://backend-production-5c86.up.railway.app

All endpoints require a Clerk JWT Bearer token ([AUTH]). See Authentication.

POST /api/pprl/run [AUTH] [RATE-LIMITED 30/hr]

Run a two-party CLK linkage between two sources and return a PII-free match summary. The run is also persisted (see GET /runs).

Request

{
  "source_a_id": "0b4f...",
  "source_b_id": "9c12...",
  "fields": ["name", "email", "phone"],
  "threshold": 0.85,
  "write_edges": false
}
FieldTypeRequiredDescription
source_a_idstringYesUUID of the first source (party A)
source_b_idstringYesUUID of the second source (party B)
fieldsstring[] | nullNoColumns to link on. Omit / null → auto-configured from party A's schema. Either way, fields are intersected with both sources' columns.
thresholdnumberNoSimilarity cutoff, 0.01.0 (default 0.85). Higher = stricter (fewer, more-confident matches).
write_edgesbooleanNoWhen true, persist cross-party matches as graph edges. Opt-in and gated — see the note below (default false).

Response 200 OK

{
  "run_id": "5f8c...",
  "match_count": 182,
  "total_comparisons": 240,
  "wrote_edges": false,
  "fields": ["name", "email", "phone"],
  "summary": {
    "match_count": 182,
    "total_comparisons": 240,
    "matches": [
      {
        "cluster_id": 0,
        "members": [
          { "party": "party_a", "row": 14 },
          { "party": "party_b", "row": 203 }
        ]
      }
    ]
  }
}
FieldTypeDescription
run_idstringUUID of the persisted run; use it with GET /runs/{id} and /download
match_countintegerNumber of cross-party match clusters found
total_comparisonsintegerCandidate pairs evaluated after blocking
wrote_edgesbooleanWhether graph edges were actually written (see note)
fieldsstring[]The fields the linkage actually keyed on (after auto-config + intersection)
summary.matchesarrayPer-cluster membership, identified only by party (party_a / party_b) and 0-based row index — no raw values

Note: match_count and row indices are the only identifying output. The row is the 0-based position in each source's raw_records (ordered by row_number), so you can join back to your own copy of the data — but the linkage never returns the underlying field values.

Errors

StatusBodyMeaning
400 Bad Request{detail: "..."}PprlInputError — a source is empty, or the two sources share no usable link field. The detail string is human-readable.
402 Payment Required{detail: {error: "quota_exceeded", gate: "pprl_runs_per_day", limit, current}}Hit the plan's daily run cap (Free 5/day, Pro 50/day)
429 Too Many RequestsExceeded 30/hour rate limit
500 Internal Server Error{detail: "..."}Unexpected failure (logged + audited)

Tip: For org callers, every run emits a pprl.run row in the audit_log (pprl.run.failed on error). The audit record carries match_count, fields, and the source ids — never raw data or the HMAC key.

Edge writes (write_edges)

write_edges: true asks the linkage to persist its cross-party matches into the identity graph as entity_edges rows tagged kind='pprl' (see The identity graph). This is double-gated:

  • It only happens when you pass write_edges: true, and
  • the server-side IDENTITY_GRAPH_EDGES_ENABLED flag is on (off by default).

When the flag is off, the request succeeds and the linkage report is unaffected — wrote_edges simply comes back false and the skip is logged. Edge writing is fail-soft: any error sets wrote_edges: false and never fails the linkage report. Match clusters larger than PPRL_MAX_CLUSTER_SIZE (default 20) are skipped for edge translation.

GET/api/pprl/runsAUTH

List the caller's PPRL runs, newest first.

Query parameters

ParamTypeDefaultDescription
limitinteger50Page size, 1200
offsetinteger0Rows to skip

Response 200 OK

{
  "runs": [
    {
      "id": "5f8c...",
      "org_id": "0123...",
      "created_by": "user_2abc...",
      "source_a_id": "0b4f...",
      "source_b_id": "9c12...",
      "fields": ["name", "email", "phone"],
      "threshold": 0.85,
      "status": "completed",
      "match_count": 182,
      "total_comparisons": 240,
      "summary": { "match_count": 182, "total_comparisons": 240, "matches": [] },
      "matches": [],
      "wrote_edges": false,
      "error": null,
      "created_at": "2026-06-13T05:00:00Z"
    }
  ]
}

Visibility mirrors Destinations: a run is visible when created_by is you or its org_id matches your org. Solo users (no org) only ever see their own runs.

GET/api/pprl/runs/{run_id}AUTH

Fetch a single run by id. Same shape as one element of the runs array above.

Errors

StatusMeaning
404 Not FoundRun doesn't exist, or isn't visible to the caller
GET/api/pprl/runs/{run_id}/downloadAUTH

Download the run's match clusters as a CSV. One row per cluster member.

Response 200 OKtext/csv with Content-Disposition: attachment; filename="pprl_run_{id}.csv":

cluster_id,party,row
0,party_a,14
0,party_b,203
1,party_a,57
1,party_b,88

The CSV is the PII-free mapping — party + 0-based row index only. Join it back to your own copy of each source to materialize the matched records on your side.

Errors

StatusMeaning
404 Not FoundRun doesn't exist, or isn't visible to the caller

Quotas

PPRL runs are metered per day (UTC) against the org's plan:

PlanRuns / day
Free5
Pro50

Hitting the cap returns 402 with {gate: "pprl_runs_per_day", limit, current}. The frontend client maps this to the standard quota-exceeded window event, so the upgrade modal fires automatically.

TypeScript client

The repo ships a thin typed client at frontend/lib/pprl.ts:

import { runLinkage, downloadPprlMapping, PprlInputError } from '@/lib/pprl'

try {
  const result = await runLinkage(token, {
    source_a_id: sourceA,
    source_b_id: sourceB,
    // fields omitted → auto-configured from schema shape
    threshold: 0.85,
  })
  console.log(`${result.match_count} matches on ${result.fields.join(', ')}`)

  // Pull the PII-free cluster -> (party, row) mapping as a CSV download
  await downloadPprlMapping(token, result.run_id)
} catch (e) {
  if (e instanceof PprlInputError) {
    // 400 — surface e.message (e.g. "No shared link fields ...")
  }
}

runLinkage handles the 402quota-exceeded dispatch automatically, so <QuotaExceededModal /> lights up without per-callsite wiring.

Was this page helpful?
Edit this page on GitHub