2026-05-15/Ben Severn

Phoenix Spree Deutschland: one cluster from raw leak to GLEIF anchor

A 9-member ICIJ cluster, 100% GLEIF-anchored, walked from source rows through GoldenMatch dedupe to a finished provenance report.

goldenmatchentity-resolutionicijgleifcase-study

The companion engineering post walks through the pipeline that ingests ICIJ Offshore Leaks, GLEIF Golden Copy, OpenSanctions, and the UK PSC register into a single 4.1M-row company table, then dedupes and list-matches it on a Railway service. This post is the proof that the pipeline works on real data: one cluster, end to end, from the raw ICIJ rows to a finished provenance report.

I picked Phoenix Spree Deutschland for this walkthrough because it's the cleanest non-trivial cluster the matcher produced. Nine ICIJ records, all in Jersey, all numbered variants of the same company name, and all nine match cleanly to GLEIF-anchored LEI records. Nothing controversial — Phoenix Spree Deutschland is a publicly traded German residential property investor with a perfectly normal corporate structure. The cluster is a test of whether the pipeline reconstructs that structure from public-leak data alone.

Full case study with all 323 relationship edges: reports/case_studies/503264_phoenix_spree_deutschland.md. Notebook walkthrough: notebooks/01_case_study.ipynb.

The raw inputs

The cluster's nine ICIJ records, before any matching:

entity_uid	name	jurisdiction
`icij:82015255`	Phoenix Spree Deutschland I Limited	je
`icij:82015256`	Phoenix Spree Deutschland II Limited	je
`icij:82015257`	Phoenix Spree Deutschland III Limited	je
`icij:82015311`	Phoenix Spree Deutschland IV Limited	je
`icij:82015312`	Phoenix Spree Deutschland V Limited	je
`icij:82015313`	Phoenix Spree Deutschland VI Limited	je
`icij:82015339`	Phoenix Spree Deutschland VII Limited	je
`icij:82015340`	Phoenix Spree Deutschland IX Limited	je
`icij:82015408`	Phoenix Spree Deutschland Limited	je

These came from the Paradise Papers leak (the Appleby subset). Note the absence of VIII — somewhere between the original incorporation and the leak, the numbering skipped, and the unnumbered "Phoenix Spree Deutschland Limited" likely sits at the top of the structure. That's the kind of inference you make from cluster shape, not from any single row.

Why this isn't trivial for the matcher

You might look at the table above and think "these are obviously the same company family — just regex out the Roman numeral." That works for this exact case. It fails the moment you point it at the next corpus, where the variants are Holdings I, Holdings (No 1), Holdings 1 Ltd, Holdings Ltd 1, and 1 Holdings Ltd. The matcher needs a representation that survives all of those.

GoldenMatch's token_sort comparator handles this well. It tokenizes the normalized name, sorts the tokens, and computes set similarity. The numerals are individual tokens; the sort puts them in canonical order; the set comparison rewards the shared-token overlap (phoenix, spree, deutschland, limited) regardless of where the numeral lands in the original string.

The intra-cluster same_as pairs after dedupe are dense — every member matches every other member at score ≥ 0.92:

left	right	cluster
`icij:82015255`	`icij:82015256`	503264
`icij:82015255`	`icij:82015257`	503264
`icij:82015255`	`icij:82015311`	503264
...36 pairs total

That density is what you want. A cluster where every member matches every other member is the matcher's way of saying "I am very confident this is one identity, not three pairs of identities that happen to overlap."

Anchoring to GLEIF

The dedupe step proves the nine ICIJ records describe the same identity family. The list-match step against GLEIF proves the matcher's confidence is correct — every ICIJ record links to a real LEI:

icij uid	gleif lei	gleif name	score	band
`icij:82015255`	`529900MQU3XI11P2FM74`	PHOENIX SPREE DEUTSCHLAND I LIMITED	1.000	perfect
`icij:82015256`	`529900Z5VU6N5X7FKD32`	PHOENIX SPREE DEUTSCHLAND II LIMITED	1.000	perfect
`icij:82015257`	`529900J6YOTOZ4XU6086`	PHOENIX SPREE DEUTSCHLAND III LIMITED	1.000	perfect
`icij:82015311`	`529900HE7LFLCQMZ7Y77`	PHOENIX SPREE DEUTSCHLAND IV LIMITED	1.000	perfect
`icij:82015312`	`529900MYUEKOW0KRP666`	PHOENIX SPREE DEUTSCHLAND V LIMITED	1.000	perfect
`icij:82015313`	`529900ZH4XA9K4B3EP79`	PHOENIX SPREE DEUTSCHLAND VII LIMITED	0.987	high
`icij:82015339`	`529900ZH4XA9K4B3EP79`	PHOENIX SPREE DEUTSCHLAND VII LIMITED	1.000	perfect
`icij:82015340`	`529900D1RM39KSHWIV43`	PHOENIX SPREE DEUTSCHLAND IX LIMITED	1.000	perfect
`icij:82015408`	`213800OR6IIJPG98AG39`	PHOENIX SPREE DEUTSCHLAND LIMITED	1.000	perfect

Eight perfect scores and one near-perfect (0.987). The near-perfect is the only interesting row: icij:82015313 is labelled Phoenix Spree Deutschland VI Limited in the ICIJ data but matched to the LEI registered for PHOENIX SPREE DEUTSCHLAND VII LIMITED. That's either a numbering inconsistency between the leak source and the GLEIF registry, or a stale ICIJ snapshot that predates a rename. Either way, it's exactly the kind of finding the pipeline should surface — a confident-but-not-perfect anchor that flags itself for human review.

Note on "100% anchored". Every cluster member found a GLEIF match. That is the strongest evidence available from public sources that these are real, registered legal entities, not paper-only shells. GLEIF only issues LEIs to entities that complete a documented self-registration process; presence in GLEIF is a positive signal of legitimacy.

The relationship graph

The 9 cluster members are not isolated in the source data. ICIJ records 323 relationship edges incident to the cluster — directors, secretaries, powers of attorney, and so on. Two structural patterns dominate:

Shared officers. Several human directors appear across multiple cluster members at overlapping dates. For example, icij:80061377 is a director of icij:82015255 from 02-APR-2007 onward and a director of icij:82015257 from the same date. That's not surprising — corporate families typically have shared board members — but it's the structural evidence that ties the cluster together independent of the name similarity.

Shared powers of attorney. A pool of about a dozen named individuals holds time-bounded powers of attorney across multiple cluster members. The dates align with what you'd expect for routine corporate administration through a service provider (Appleby in this case). Most POA windows are one year, renewed.

No address-sharing surprises. The cluster does not surface an unusual registered-address fanout. That matters because the address-cluster pass is one of the strongest signals for shell-company structures — Portcullis TrustNet Chambers in the same corpus hosts 33,858 distinct entities. Phoenix Spree's address pattern looks like an ordinary fund structure, not a mass-registration address.

What the pipeline actually proved

Three things, in order of confidence:

Identity. The nine ICIJ records describe nine distinct legal entities that belong to one corporate family, anchored by nine LEIs in GLEIF. This is the strongest claim and the one the matcher is best at.
Structure. The relationship graph shows the family has a consistent service-provider footprint (Appleby) and a stable shared-officer core. This is structural inference from edge patterns, not from any single record.
Routine character. Nothing about the cluster's address pattern, officer pattern, or GLEIF anchoring profile suggests anomaly. The corpus contains many clusters that do look anomalous; this one doesn't.

The third claim is the easiest to misread. "Nothing looks anomalous" is not the same as "this is definitely fine." It means the pipeline didn't surface a signal worth chasing, given the public data we have. Public data is incomplete by definition.

The reproduction recipe

If you want to run this end to end against your own ICIJ + GLEIF inputs:

# 1. Build the unified company table
uv run python scripts/build_candidate_tables.py

# 2. Drop placeholder names + mega-blocks
uv run python scripts/filter_company_table.py

# 3. Run dedupe with the tuned config
uv run python scripts/run_goldenmatch_full.py --what company

# 4. Extract GLEIF as a reference table
uv run python scripts/extract_gleif_unified.py

# 5. List-match deduped entities against GLEIF
gm match-against \
  --target processed/company_entities.parquet \
  --against processed/gleif_unified.parquet \
  --run-name icij_os_vs_gleif

# 6. Publish to Postgres for query/notebooks
uv run python scripts/publish.py --what company

The cluster ID for Phoenix Spree (503264 in my run) won't be the same in yours — cluster IDs are assigned by GoldenMatch and depend on input order. The notebook walks through finding it by name search rather than by ID.

Key takeaways

A dense intra-cluster same_as graph (every member matched to every other) is a much stronger confidence signal than a few high-scoring pairs.
List-matching against GLEIF is the right anchor step for European corporate records. Where an LEI exists, the match is usually 1.000 and unambiguous.
"100% GLEIF-anchored" is a positive provenance signal, not a sign of impropriety. Treat it as evidence the cluster is real registered entities, not artefacts of the leak.
The numbered-variant pattern (Holdings I, Holdings II...) is exactly what token_sort is good at. Default jaro_winkler would over-cluster these on the shared prefix.

What's next

The third post in the series is the messier counterpart: 28 hand-curated seeds from public reporting about Jeffrey Epstein's corporate network, and what the matcher found (and didn't find) in the public-leak corpus. One corroborated lead, several dead ends, and a structural gap (the USVI registry isn't in any public dataset I had).

Full repo: benseverndev-oss/goldenmatch-shell-company-network. Try the matcher yourself: pip install goldenmatch.

← Back to blog