2026-04-09/Ben Severn

Wallet Attribution at Scale: ER on 13M Blockchain Records

Running entity resolution across 10 public blockchain attribution datasets surfaces cross-jurisdictional sanctions and universal infrastructure patterns.

entity-resolutiongoldenmatchblockchainsanctionspython

Every public blockchain attribution dataset is a partial, opinionated view of the same underlying reality. OFAC publishes ~800 sanctioned crypto wallets. Etherscan crowdsources ~50,000 tags across seven EVM chains. Sourcify holds ~14 million verified contract deployments. Forta tracks known-malicious contracts. DeFiLlama catalogs protocol addresses. Israel's Ministry of Defense and the FBI's Lazarus unit each publish their own targeted lists. None of them agree, none of them are complete, and almost none of them talk to each other.

I wanted to know what happens if you reconcile all of them. Not with a custom schema, not with hand-written joins — with a single entity resolution pipeline pointed at every public source I could find. The answer is 13,147,920 input rows, 30,958 multi-source clusters, and three findings I could not have produced at smaller scale.

The ten sources

I pulled every freely redistributable blockchain attribution dataset I could verify:

#SourceRowsWhat it covers
1OFAC SDN Enhanced XML788US Treasury sanctioned wallets, 18 chains
2brianleect/etherscan-labels52,773Crowdsourced Etherscan tags, 7 EVM chains
3dawsbot/eth-labels17,495Curated Ethereum mainnet categories
4Sourcify parquet exports13,062,088Verified contract deployments, all chains
5Forta labelled-datasets7,480Known malicious contracts + phishing
6DeFiLlama protocols3,332Protocol contract addresses
7ScamSniffer blacklist2,530Reported scam addresses
8ethereum-lists717Dark/light address lists
9OpenSanctions: il_mod_crypto684Israel MoD sanctioned wallets
10OpenSanctions: us_fbi_lazarus_crypto33FBI Lazarus Group wallets
Total13,147,920

Sourcify dominates by two orders of magnitude. Everything else is a long tail of curated, opinionated, high-signal labels. That asymmetry shapes the whole story: Sourcify tells you what addresses exist, the other nine tell you what they mean, and entity resolution is what turns one into the other.

The schema

Every source projects to five columns:

COMMON_COLS = ["address_norm", "address_raw", "entity_name", "label", "source"]

No fuzzy matching on names. Names disagree too often to be a primary signal — that's actually the central finding. The only reliable join is address_norm.

Running GoldenMatch at 535k

I started at a sensible scale: five sources plus Sourcify's Ethereum mainnet subset, 535,336 rows, and a direct call to goldenmatch.dedupe:

import goldenmatch as gm

result = gm.dedupe(
    str(STAGED / "01_ofac.csv"),
    str(STAGED / "02_etherscan_labels.csv"),
    str(STAGED / "03_eth_labels.csv"),
    str(STAGED / "04_ethereum_lists.csv"),
    str(STAGED / "05_defillama.csv"),
    str(STAGED / "06_sourcify.csv"),
    exact=["address_norm"],
    blocking=["address_norm"],
)

This finished in about 40 seconds on a Windows laptop, found 12,640 multi-member clusters, and auto-fixed 51 data quality issues in the raw public sources (smart quotes, invisible characters, stray whitespace) before matching. GoldenCheck's quality scanner is bundled into the dedupe call — you don't ask for it, it just happens.

The results at 535k surfaced the best single anecdote in the whole exercise:

AddressOFAC nameEtherscan name
0x098B716B8Aaf21512996dC57EB0615e2383E2f96LAZARUS GROUPRonin Bridge Exploiter
0x5f48c2a71b2cc96e3f0ccae4e39318ff0dc375b2SEMENOV, RomanTornado.Cash: Team 1 Vesting

The first row is the Axie Infinity Ronin Bridge wallet — the address behind the $625 million Lazarus Group hack, labeled by OFAC as "LAZARUS GROUP" and by Etherscan as "Ronin Bridge Exploiter." Two correct names, completely unrelated strings. A name-based join finds nothing. An address-normalized join finds the link instantly. The second row ties a sanctioned Tornado Cash co-founder to a specific named vesting contract. If you took only one thing from this post, take that: names disagree, addresses don't.

Scaling to 13 million

The 535k run validated the pipeline. I wanted to know what happened at the real ceiling of free public data. That meant pulling all 14 Sourcify deployment parquets (one per million contracts, ~2 GB total) covering every chain Sourcify tracks — not just Ethereum mainnet.

# fetch_all_sourcify.py — parallel download of 14 parquets
with ThreadPoolExecutor(max_workers=4) as ex:
    for fut in as_completed([ex.submit(fetch_parquet, f) for f in files]):
        fut.result()

And then the staging step streams them directly into the common schema via Polars without ever materializing the full frame:

for pq in parquets:
    df = pl.read_parquet(pq, columns=["chain_id", "address"])
    out = df.select([
        pl.col("address").bin.encode("hex").alias("address_norm"),
        ("0x" + pl.col("address").bin.encode("hex")).alias("address_raw"),
        pl.lit("").alias("entity_name"),
        ("sourcify:chain_" + pl.col("chain_id").cast(pl.Utf8)).alias("label"),
        pl.lit("sourcify").alias("source"),
    ])
    out.write_csv(f, include_header=False)

After staging, the full dataset is 13,147,920 rows across 10 sources.

The honest caveat

At 13M rows, calling goldenmatch.dedupe crashes at the cluster-materialization step with a MemoryError in the Python dict build-out. That's not a GoldenMatch bug — it's pure Python object overhead on 12 million unique cluster keys. Since the full pipeline was already reducing to exact-match-on-address_norm blocking (names disagree too much to fuzzy on), the operation is mathematically equivalent to a polars groupby. I wrote that directly:

all_df = pl.concat([pl.scan_csv(p) for p in STAGED.glob("*.csv")]).collect()

clusters = (
    all_df.group_by("address_norm")
    .agg([
        pl.col("source").unique().alias("sources"),
        pl.col("source").n_unique().alias("n_sources"),
        pl.col("source").len().alias("size"),
        pl.col("entity_name").filter(pl.col("entity_name") != "").alias("entities"),
        pl.col("label"),
    ])
)

Same logical result, fits in memory, runs in about 30 seconds. The 535k run proves the ER pipeline works end-to-end with GoldenMatch's full feature set (fuzzy scorers, blocking strategies, lineage, golden records). The 13M run uses GoldenMatch's auto-config decisions as the template but delegates the exact-match groupby to Polars because Python dicts are the wrong data structure at that volume. I want to be upfront about that — the scale-up is not an endorsement of "GoldenMatch scales to 13M natively," it's an endorsement of "GoldenMatch chose the right blocking strategy at 535k, and that strategy is trivially reproducible at 13M in a columnar engine."

What the 13M run surfaced

1. Nine wallets cross-sanctioned by two governments

This is the headline finding and only possible because I had two independent sanctions sources. Nine crypto wallets appear on both the US Treasury OFAC list and Israel's MoD sanctioned crypto list:

WalletEntityChain
TCzq6m2zxnQkrZrf8cqYcK6bbXQYAfWYKCZEDCEX EXCHANGE LTDTron
TGsNFrgWfbGN2gX25Wcf8oTejtxtQkvmExZEDCEX EXCHANGE LTDTron
TTS9o5KkpGgH8cK9LofLmMAPYb5zfQvSNaZEDCEX EXCHANGE LTDTron
TNuA5CQ6LB4jTHoNrjEeQZJmcmhQuHMbQ7ZEDCEX EXCHANGE LTDTron
TLvuvpfBKdxddxSsJefeiGCe9eVY8HUroEZEDCEX EXCHANGE LTDTron
TWBAPzpPiZarfVsY2BLXeaLhNHurn4wkWGAL-LAW, Tawfiq Muhammad Sa'idTron
0x175d44451403Edf28469dF03A9280c1197ADb92cGAZA NOWEthereum
0x21B8d56BDA776bbE68655A16895afd96F5534feDGAZA NOWEthereum
19D1iGzDr7FyAdiy3ZZdxMd6ttHj1kj6WWBUY CASH MONEY AND MONEY TRANSFER COBitcoin

The ZEDCEX cluster is the standout: five wallets on a single Tron-based exchange independently sanctioned by both the United States and Israel. GAZA NOW contributes two cross-confirmed Ethereum wallets. These are the highest-confidence sanctioned wallets in the entire dataset — not because any individual list is more authoritative, but because two independent government entity resolution processes landed on the same on-chain identities.

You cannot find this with OFAC alone. You cannot find it with Israel's list alone. You find it only when you reconcile them.

2. The largest clusters are universal infrastructure

At multi-chain scale, the top multi-source clusters by member count are all deterministic-deployment contracts:

SizeAddressWhat it is
450x7cbb62eaa69f79e6873cd1ecb2392971036cfaa4Safe: Create Call 1.3.0
430x40a2accbd92bca938b02010e17a5b8929b49130dSafe: Multi Send Call Only 1.3.0
420xa6b71e26c5e0845f74c812102ca7114b6a896ab2Safe: Proxy Factory 1.3.0
400x3e5c63644e683549055b9be8653de26e0b4cd36eSafe: Singleton L2 1.3.0
390xd9db270c1b5e3bd161e8c8503c55ceabee709552Safe: Singleton 1.3.0
370xf48f2b2d2a534e402487b3ee7c18c33aec0fe5e4Safe: Compatibility Fallback Handler 1.3.0
300x000000000022d473030f116ddee9f6b43ac78ba3Uniswap Permit2
270x66a71dcef29a0ffbdbe3c6a460a3b5bc225cd675LayerZero Ethereum Endpoint

Safe (formerly Gnosis Safe) uses CREATE2 with chain-independent salts, which means the same singleton contract ends up at the same address on every EVM chain it's deployed to. So do Permit2 and LayerZero. The cluster size is literally a count of "how many chains is this deployed on?" — 45 chains for Create Call 1.3.0, 30 for Permit2.

That's a real finding about the structure of the modern EVM ecosystem. Entity resolution on multi-chain contract deployment data automatically surfaces the universal-infrastructure layer without anyone asking it to. If you're building an allowlist of "standard reusable contracts that are safe-by-reputation across every chain," this cluster table is a reasonable starting point. I did not go in looking for this — it just fell out of the data.

3. Attackers verify source code at the long-tail baseline rate

I also pulled Forta's labelled-datasets, which include 719 known-malicious Ethereum smart contracts and 569 phishing-scam contracts. The honest question: do attackers publish verified source code on Sourcify?

PopulationSizeVerified on Sourcify
Forta malicious contracts7193 (0.4%)
Forta phishing contracts5693 (0.5%)
ScamSniffer addresses2,5300 (0%)

This is where I have to resist the obvious headline. "Malicious contracts almost never verify source code" is technically true but misleading: Sourcify holds ~324k verified Ethereum mainnet contracts against an estimated ~70M+ total contracts ever deployed, which puts the baseline verification rate around 0.5%. Malicious contracts at 0.4% are statistically indistinguishable from that baseline.

The defensible framing is this: malicious contracts behave like the long tail of random/abandoned/spam contracts, not like production contracts. Mainstream DeFi protocols verify at rates well above 50%. Attackers don't. For the purposes of attribution, "is this a Sourcify-verified contract?" on its own is a weak filter — but "verified Sourcify contract AND has Etherscan tags AND appears in eth-labels AND DeFiLlama" is an extremely strong legitimacy signal. The 301 quadruple-confirmed clusters at 13M scale are the set of contracts that every independent attribution observer agrees exist and matter.

The 3 verified malicious contracts are outliers worth manual investigation: two are Fake_Phishing tagged contracts that nonetheless published source (presumably to look legitimate to a casual reviewer), and one is a suspicious "TrueEUR" token.

What else is in the data

A few secondary findings that didn't need their own section:

Deployer-reuse patterns in the malicious-contracts dataset. The Forta dataset records contract_creator for each malicious contract. Grouping by creator surfaces a heavily skewed distribution: one address deployed 15 Fake_Phishing contracts, another deployed 11, and the original bZx Exploiter 1 wallet deployed 9 distinct exploit contracts — all tied back to the 2020 bZx flash-loan attack. Twelve deployer addresses are responsible for roughly 15% of the entire labeled malicious-contract corpus. Watching deployers is dramatically more efficient than watching deployments, and the clustering falls out of ER trivially.

OFAC's internal duplicates. The SUEX OTC wallets appear in OFAC's own list twice — once under the XBT:CYBER2 program and once under USDT:CYBER2, because Treasury sanctioned the same Bitcoin address for Bitcoin activity and the Tron-USDT it bridged through. Without ER you'd treat them as two distinct records; with ER the internal duplication is obvious.

Cross-source pattern distribution. At 13M the dominant multi-source pattern is etherscan-labels + sourcify (12,472 clusters) — verified contracts that are also tagged. Then eth-labels + forta (5,146) — curated DeFi labels overlapping with malicious flags. Then the triple-confirmed eth-labels + etherscan-labels + sourcify (5,034). The full distribution is in output_15m/report.json in the companion repo.

Honest limitations

I want to be precise about what this analysis is and isn't:

Takeaways

Reproduce it

Everything in this post is in a public repo: benzsevern/goldenmatch-wallet-attribution. The full flow is four commands:

python fetch_public_data.py    # ~2.5 GB download, ~10 min
python extract_ofac.py         # parse SDN_ENHANCED.xml
python run_15m.py              # stage 10 sources to common schema
python analyze_15m.py          # cross-source cluster analysis

All ten data sources are permissively licensed and redistributable. No API keys. No auth. The ~13M-row analysis finishes in about 3 minutes of wall-clock time on a laptop once the data is local.

If you want to see what GoldenMatch looks like with its full feature set — fuzzy scoring, blocking strategies, lineage tracking, golden records — the earlier archive/run_clusters.py runs it on the 535k-row subset end-to-end. That's the run that surfaced the Lazarus / Ronin Bridge case. Both scripts are preserved because they're answering different questions: does the ER pipeline work? (yes, 535k with GoldenMatch) and what falls out at the public-data ceiling? (the 13M run above).

Install GoldenMatch: pip install goldenmatch. Star the repo: benzsevern/goldenmatch. Try the live playground: bensevern.dev/playground.

Reproducibility footer.