2026-05-15/Ben Severn

Phoenix Spree Deutschland: one cluster from raw leak to GLEIF anchor

A 9-member ICIJ cluster, 100% GLEIF-anchored, walked from source rows through GoldenMatch dedupe to a finished provenance report.

goldenmatchentity-resolutionicijgleifcase-study

The companion engineering post walks through the pipeline that ingests ICIJ Offshore Leaks, GLEIF Golden Copy, OpenSanctions, and the UK PSC register into a single 4.1M-row company table, then dedupes and list-matches it on a Railway service. This post is the proof that the pipeline works on real data: one cluster, end to end, from the raw ICIJ rows to a finished provenance report.

I picked Phoenix Spree Deutschland for this walkthrough because it's the cleanest non-trivial cluster the matcher produced. Nine ICIJ records, all in Jersey, all numbered variants of the same company name, and all nine match cleanly to GLEIF-anchored LEI records. Nothing controversial — Phoenix Spree Deutschland is a publicly traded German residential property investor with a perfectly normal corporate structure. The cluster is a test of whether the pipeline reconstructs that structure from public-leak data alone.

Full case study with all 323 relationship edges: reports/case_studies/503264_phoenix_spree_deutschland.md. Notebook walkthrough: notebooks/01_case_study.ipynb.

The raw inputs

The cluster's nine ICIJ records, before any matching:

entity_uidnamejurisdiction
icij:82015255Phoenix Spree Deutschland I Limitedje
icij:82015256Phoenix Spree Deutschland II Limitedje
icij:82015257Phoenix Spree Deutschland III Limitedje
icij:82015311Phoenix Spree Deutschland IV Limitedje
icij:82015312Phoenix Spree Deutschland V Limitedje
icij:82015313Phoenix Spree Deutschland VI Limitedje
icij:82015339Phoenix Spree Deutschland VII Limitedje
icij:82015340Phoenix Spree Deutschland IX Limitedje
icij:82015408Phoenix Spree Deutschland Limitedje

These came from the Paradise Papers leak (the Appleby subset). Note the absence of VIII — somewhere between the original incorporation and the leak, the numbering skipped, and the unnumbered "Phoenix Spree Deutschland Limited" likely sits at the top of the structure. That's the kind of inference you make from cluster shape, not from any single row.

Why this isn't trivial for the matcher

You might look at the table above and think "these are obviously the same company family — just regex out the Roman numeral." That works for this exact case. It fails the moment you point it at the next corpus, where the variants are Holdings I, Holdings (No 1), Holdings 1 Ltd, Holdings Ltd 1, and 1 Holdings Ltd. The matcher needs a representation that survives all of those.

GoldenMatch's token_sort comparator handles this well. It tokenizes the normalized name, sorts the tokens, and computes set similarity. The numerals are individual tokens; the sort puts them in canonical order; the set comparison rewards the shared-token overlap (phoenix, spree, deutschland, limited) regardless of where the numeral lands in the original string.

The intra-cluster same_as pairs after dedupe are dense — every member matches every other member at score ≥ 0.92:

leftrightcluster
icij:82015255icij:82015256503264
icij:82015255icij:82015257503264
icij:82015255icij:82015311503264
...36 pairs total

That density is what you want. A cluster where every member matches every other member is the matcher's way of saying "I am very confident this is one identity, not three pairs of identities that happen to overlap."

Anchoring to GLEIF

The dedupe step proves the nine ICIJ records describe the same identity family. The list-match step against GLEIF proves the matcher's confidence is correct — every ICIJ record links to a real LEI:

icij uidgleif leigleif namescoreband
icij:82015255529900MQU3XI11P2FM74PHOENIX SPREE DEUTSCHLAND I LIMITED1.000perfect
icij:82015256529900Z5VU6N5X7FKD32PHOENIX SPREE DEUTSCHLAND II LIMITED1.000perfect
icij:82015257529900J6YOTOZ4XU6086PHOENIX SPREE DEUTSCHLAND III LIMITED1.000perfect
icij:82015311529900HE7LFLCQMZ7Y77PHOENIX SPREE DEUTSCHLAND IV LIMITED1.000perfect
icij:82015312529900MYUEKOW0KRP666PHOENIX SPREE DEUTSCHLAND V LIMITED1.000perfect
icij:82015313529900ZH4XA9K4B3EP79PHOENIX SPREE DEUTSCHLAND VII LIMITED0.987high
icij:82015339529900ZH4XA9K4B3EP79PHOENIX SPREE DEUTSCHLAND VII LIMITED1.000perfect
icij:82015340529900D1RM39KSHWIV43PHOENIX SPREE DEUTSCHLAND IX LIMITED1.000perfect
icij:82015408213800OR6IIJPG98AG39PHOENIX SPREE DEUTSCHLAND LIMITED1.000perfect

Eight perfect scores and one near-perfect (0.987). The near-perfect is the only interesting row: icij:82015313 is labelled Phoenix Spree Deutschland VI Limited in the ICIJ data but matched to the LEI registered for PHOENIX SPREE DEUTSCHLAND VII LIMITED. That's either a numbering inconsistency between the leak source and the GLEIF registry, or a stale ICIJ snapshot that predates a rename. Either way, it's exactly the kind of finding the pipeline should surface — a confident-but-not-perfect anchor that flags itself for human review.

Note on "100% anchored". Every cluster member found a GLEIF match. That is the strongest evidence available from public sources that these are real, registered legal entities, not paper-only shells. GLEIF only issues LEIs to entities that complete a documented self-registration process; presence in GLEIF is a positive signal of legitimacy.

The relationship graph

The 9 cluster members are not isolated in the source data. ICIJ records 323 relationship edges incident to the cluster — directors, secretaries, powers of attorney, and so on. Two structural patterns dominate:

Shared officers. Several human directors appear across multiple cluster members at overlapping dates. For example, icij:80061377 is a director of icij:82015255 from 02-APR-2007 onward and a director of icij:82015257 from the same date. That's not surprising — corporate families typically have shared board members — but it's the structural evidence that ties the cluster together independent of the name similarity.

Shared powers of attorney. A pool of about a dozen named individuals holds time-bounded powers of attorney across multiple cluster members. The dates align with what you'd expect for routine corporate administration through a service provider (Appleby in this case). Most POA windows are one year, renewed.

No address-sharing surprises. The cluster does not surface an unusual registered-address fanout. That matters because the address-cluster pass is one of the strongest signals for shell-company structures — Portcullis TrustNet Chambers in the same corpus hosts 33,858 distinct entities. Phoenix Spree's address pattern looks like an ordinary fund structure, not a mass-registration address.

What the pipeline actually proved

Three things, in order of confidence:

  1. Identity. The nine ICIJ records describe nine distinct legal entities that belong to one corporate family, anchored by nine LEIs in GLEIF. This is the strongest claim and the one the matcher is best at.
  2. Structure. The relationship graph shows the family has a consistent service-provider footprint (Appleby) and a stable shared-officer core. This is structural inference from edge patterns, not from any single record.
  3. Routine character. Nothing about the cluster's address pattern, officer pattern, or GLEIF anchoring profile suggests anomaly. The corpus contains many clusters that do look anomalous; this one doesn't.

The third claim is the easiest to misread. "Nothing looks anomalous" is not the same as "this is definitely fine." It means the pipeline didn't surface a signal worth chasing, given the public data we have. Public data is incomplete by definition.

The reproduction recipe

If you want to run this end to end against your own ICIJ + GLEIF inputs:

# 1. Build the unified company table
uv run python scripts/build_candidate_tables.py

# 2. Drop placeholder names + mega-blocks
uv run python scripts/filter_company_table.py

# 3. Run dedupe with the tuned config
uv run python scripts/run_goldenmatch_full.py --what company

# 4. Extract GLEIF as a reference table
uv run python scripts/extract_gleif_unified.py

# 5. List-match deduped entities against GLEIF
gm match-against \
  --target processed/company_entities.parquet \
  --against processed/gleif_unified.parquet \
  --run-name icij_os_vs_gleif

# 6. Publish to Postgres for query/notebooks
uv run python scripts/publish.py --what company

The cluster ID for Phoenix Spree (503264 in my run) won't be the same in yours — cluster IDs are assigned by GoldenMatch and depend on input order. The notebook walks through finding it by name search rather than by ID.

Key takeaways

What's next

The third post in the series is the messier counterpart: 28 hand-curated seeds from public reporting about Jeffrey Epstein's corporate network, and what the matcher found (and didn't find) in the public-leak corpus. One corroborated lead, several dead ends, and a structural gap (the USVI registry isn't in any public dataset I had).

Full repo: benseverndev-oss/goldenmatch-shell-company-network. Try the matcher yourself: pip install goldenmatch.