Why Knowledge Graphs Live or Die on Entity Resolution
A knowledge graph is only as good as its entities. Why bad entity resolution wrecks KG quality and cost, and how GoldenMatch solves the node layer.
Picture two people named John Smith. One signed a contract with Acme in 2019, the other joined a competitor in 2023. Your knowledge graph collapses them into a single node. Now the graph asserts — as fact, with a clean typed edge — that one person did both. Feed that graph to an LLM and it will tell you so, confidently, with a citation. The citation points at your own data.
That is the failure mode nobody demos. Entity resolution — deciding which records refer to the same real-world thing — is the quietest, most load-bearing layer of any knowledge graph. Get it right and the graph reasons. Get it wrong and you have built a very expensive machine for laundering bad data into confident answers.
This post walks the whole chain: what a knowledge graph is, why it suddenly matters again, how graphs actually get built (and where the duplicates sneak in), exactly how bad entity resolution corrupts a graph in both quality and cost, and how GoldenMatch approaches the node layer — with real numbers from a benchmark that has ground truth.
What a knowledge graph actually is
Strip the marketing and a knowledge graph is three things:
- Nodes — entities. A person, a company, a drug, an account. Each node is supposed to map to exactly one real-world thing.
- Edges — typed, directed relationships between nodes.
(John Smith) —[SIGNED]→ (Acme Contract 4471). - A schema (ontology) — the allowed node types and edge types, so the graph means something consistent.
The unit of value is the triple: subject, predicate, object. (Acme) —[ACQUIRED]→ (Beta Corp). Stack millions of those and you get something a row-and-column table can't give you: meaning that you can traverse.
The critical word in the node definition is exactly one. A relational table tolerates five rows for the same customer — you GROUP BY later and move on. A graph does not give you that grace. Five nodes for one customer is five different "people" as far as every traversal, every embedding, and every LLM reading the graph is concerned. The graph's correctness is downstream of one decision made millions of times: is this the same entity or not?
That decision is entity resolution. The graph doesn't make it. Something upstream has to.
Why knowledge graphs are valuable
Tables answer "what." Graphs answer "how is this connected to that," which is where most real questions actually live:
- Customer 360 — one person scattered across HubSpot, Salesforce, Stripe, and a support tool becomes one node with every interaction hanging off it.
- Fraud rings — you don't catch fraud by looking at one account. You catch it by finding the shared phone number three "different" applicants quietly have in common.
- Drug discovery and biomedical research — proteins, compounds, diseases, and trials as a traversable web of relationships.
- Multi-hop reasoning — "which of our customers are also vendors to a company we just flagged?" is two hops in a graph and a migraine in SQL.
The payoff is explainable retrieval. A graph can show its work: here is the path, here are the edges, here is why the answer is the answer. That property is exactly why the AI world came back for knowledge graphs.
Why they're suddenly everywhere again
Knowledge graphs are not new — Google shipped the term in 2012. What changed is that large language models gave them a killer app, and the timing was 2024 through 2025.
The first wave of retrieval-augmented generation was pure vector search: chunk your documents, embed them, retrieve the nearest fuzzy neighbors, stuff them in the prompt. It works until the question needs connected facts rather than similar text. "Summarize everything related to this customer across all our systems" is not a similarity query — it's a graph traversal. Vector RAG retrieves passages that look alike; it cannot tell you that two passages are about the same person under two different spellings.
Graph-augmented retrieval — popularized by Microsoft's GraphRAG and a wave of follow-ons — closes that gap. Retrieve a node, walk its edges, hand the model precise, connected, deduplicated facts instead of a pile of overlapping chunks.
| Vector RAG | Graph RAG | |
|---|---|---|
| Retrieval unit | text chunk | entity + its edges |
| Finds | semantically similar passages | connected facts |
| Multi-hop questions | weak (each hop is a new fuzzy search) | native (traverse the edges) |
| Source dedup | none — returns overlapping chunks | one node per entity (if ER is good) |
| Explainability | "these chunks scored near your query" | "this path connects A to B" |
| Primary failure mode | misses the connection | poisoned by bad entity resolution |
Look at the last row. Vector RAG's weakness is omission — it just doesn't find the link. Graph RAG's weakness is commission — it confidently serves a link that your entity resolution invented. The hype moved the bottleneck onto the node layer; it didn't remove it.
How knowledge graphs get built — and where the duplicates come from
A graph gets populated two ways, and both manufacture duplicates.
Structured ingestion pulls rows from systems you already own — CRMs, warehouses, billing. The duplicates here are the classic kind: the same customer in HubSpot and Salesforce, the same vendor entered twice with a typo. Familiar, and exactly what entity resolution was built for.
Unstructured extraction is the new firehose, and it's the one the LLM era cranked wide open. You point a model at ten thousand documents — contracts, news, support tickets, filings — and ask it to emit entities and relationships. It's astonishingly good at the extraction. It is, by design, terrible at canonicalization. Read enough text and the same company comes back as:
IBM
I.B.M.
International Business Machines
IBM Corp.
International Business Machines Corporation
Five strings, one company. The extraction step has no memory across documents — each mention is judged on its own, so each surface form becomes a candidate node. Nothing in the LLM's extraction pass knows these are the same entity. The model made it trivial to produce mentions and did nothing to reconcile them.
So the real knowledge-graph build pipeline has three stages, not two:
- Extract — pull entities and relations from structured rows or unstructured text.
- Resolve — collapse all the mentions and records that refer to the same real-world thing into one canonical entity. This is entity resolution.
- Construct — write the resolved entities as nodes and the relations as edges.
Skip stage 2 and you don't get a knowledge graph. You get a mention graph — one node per spelling, with the real connections smeared across the duplicates. Every serious GraphRAG implementation names this stage explicitly ("entity resolution," "entity disambiguation," "canonicalization") because without it the graph doesn't cohere. And here's the part people miss: making extraction easy with an LLM makes the resolution problem bigger, not smaller. More documents, more surface forms, more duplicates to reconcile.
Why bad entity resolution wrecks a knowledge graph
Bad resolution fails in two directions, and both are expensive.
Over-merging glues distinct entities into one node. This is the John-Smith case, and it is the more dangerous failure because it is invisible. The graph looks healthier — fewer nodes, denser edges, tidy. But every fact from entity A is now asserted about entity B. Edges that never existed in reality now exist in the graph. An LLM reading it has no way to know; merged is merged. You have manufactured false relationships and given them the authority of structured data.
Under-merging does the opposite: one real entity fragments across many nodes. Now the multi-hop path that should connect two facts is broken, because the connecting entity exists three times and none of the copies hold the full picture. The honest-looking answer becomes "not in the graph" — when it absolutely is in the graph, just smeared across duplicates.
A worked example
Here are three records, the kind that land in a multi-source pool:
| Record (source) | Name | Phone | Company | |
|---|---|---|---|---|
| r1 (HubSpot) | John Smith | j.smith@acme.com | 555-0142 | Acme Corp |
| r2 (Salesforce) | John Smith | john.smith@beta.io | 555-0142 | Beta Industries |
| r3 (Mailchimp) | J. Smith | j.smith@acme.com | — | Acme |
The truth: r1 and r3 are the same person (Acme John — same email, same employer). r2 is a different John Smith at Beta who happens to share a phone number, because 555-0142 is a reused office line. The correct graph has two person nodes.
Now resolve it badly — key on exact phone, treat a phone match as strong evidence. r1 and r2 share 555-0142, so they merge. r3 joins on email. All three collapse into one node, and the graph dutifully writes both employment edges:
(John Smith) —[WORKS_AT]→ (Acme Corp)
(John Smith) —[WORKS_AT]→ (Beta Industries)
Ask the graph the question it exists to answer — "find anyone connected to both Acme and a competitor" — and it surfaces this phantom John Smith as a hit. Someone acts on a relationship that was never real. That single bad merge didn't just add a duplicate; it injected a false fact into a system whose entire value proposition is that its facts are trustworthy.
The same failure, at benchmark scale
We benchmark GoldenMatch against a synthetic-but-realistic multi-source CRM fixture — 467 records, 496 known-true match pairs across 180 real people, with the full mess of real data: nicknames, initials, typos, maiden and married surnames, work-versus-personal email, phone-format drift, company-suffix variants, missing fields. It has ground truth, so precision and recall are real, not vibes.
Run a naive zero-config matcher against it — the kind of "just dedupe it" call that looks reasonable in a notebook — and the autoconfig latches onto exact-phone and a source-id field as match keys and over-merges exactly like the worked example, but 467 times:
| Approach (realistic multi-source CRM) | Precision | Recall | F1 |
|---|---|---|---|
| Bare zero-config matching | 0.07 | 0.96 | 0.13 |
A precision of 0.07 means 93% of the merges are wrong. Recall looks gorgeous at 0.96 because when you merge nearly everything, you do technically catch all the true pairs — along with a flood of false ones. Drop that output into a knowledge graph and you haven't built a graph; you've built a blender. Every over-merged node poisons every query that touches it.
The part the quality conversation skips: cost
- Storage and embeddings. Every duplicate node is another row, another set of properties, and — in a GraphRAG setup — another embedding to compute, store, and index. Five copies of one company is five embeddings of the same thing, and a vector index that's five times larger than it needs to be at that node.
- Edges scale combinatorially. Duplicates don't just add nodes, they multiply edges. Three copies of a node that each accumulate their own relationships can produce well over 3x the edges, and graph traversal cost rises with edge count. Fan-out compounds at every hop.
- Token waste at query time. A retriever that pulls "all facts about Acme" and gets back three half-complete Acme nodes feeds the LLM redundant, conflicting context — more tokens, slower responses, worse answers. You pay for the duplicates on every single query, forever.
- Human cleanup later. Un-resolving a graph after the fact — splitting merged nodes, reconciling duplicates, tracing which edges were real — is dramatically more expensive than resolving the entities before they ever became nodes. Bad ER is technical debt that accrues interest in three currencies at once: dollars, latency, and trust.
Bad entity resolution doesn't add cost linearly. It compounds, because the graph's whole point is connection, and you've made the wrong things connect.
How GoldenMatch approaches the node layer
GoldenMatch is the entity-resolution engine — the part that decides what the nodes are before anything builds a graph on top. The pipeline is four stages:
- Blocking — group records that share a high-confidence key (normalized email, phone, tax id) so you never compare all N-squared pairs.
- Scoring — for each candidate pair, run field-aware scorers (name as token similarity, email exact, address token-sort, phone as normalized E.164) and combine them with Fellegi-Sunter probabilistic weights.
- Clustering — Union-Find over the pairs above threshold, so a chain of pairwise matches resolves into one coherent entity.
- Survivorship — pick the surviving value per field per cluster, producing one golden record per real-world thing.
The output of stage 3 is, quite literally, a graph. Internally, resolution emits explicit match edges — (src_record, dst_record, score) — and the entities are the connected components over those edges:
# the edge an entity-resolution run actually produces
Edge = tuple[str, str, float] # (src_record_id, dst_record_id, match_score)
# entities = connected components over the retained edges
def compute_components(record_ids, edges):
# Union-Find: every record lands in exactly one component;
# records touched by no edge become their own singleton.
...
Every edge carries its score and its provenance — which run produced it, which fields fired. That matters for a knowledge graph because it means the node-formation decision is itself auditable. You can ask why two records became one entity, instead of trusting a black box:
import goldenmatch as gm
# explain why a cluster merged, in plain language
reason = gm.explain_cluster_nl(cluster, df, matchkeys)
print(reason)
# -> "Merged on exact email + high name similarity (0.94);
# address agreed after standardization."
Blocking and scorer weighting decide everything at scale
Two of those four stages are where graphs are won or lost, and they're the two the naive approach gets wrong.
Blocking is not optional once the graph is big. A graph with 10 million nodes has on the order of fifty trillion possible pairs. You cannot score them all, and you shouldn't want to — almost none are matches. Blocking groups records by a high-confidence key so you only compare the thousands of pairs that could plausibly be the same entity. The blocking key is itself a quality lever: too loose and you drown in comparisons, too tight and you never put two true matches in the same block, so they can never merge. Good blocking is what makes resolution tractable and recall-preserving at the same time.
Scorer weighting is why the over-merge happened — and how it's fixed. The bare run collapsed because it treated an exact phone match as strong evidence. Fellegi-Sunter weights each field by how discriminating it actually is: it learns, per field, how often agreement happens among true matches versus by random chance. A shared rare surname is powerful evidence. A shared common phone format, or a phone number that turns out to be a reused office line, is weak. The fix for the 0.07 disaster is a curated, column-aware config: exclude the source-id and per-source identifier columns from matching, up-weight surname and email, and demote phone to a blocking-only signal — good enough to put records in the same candidate block, not strong enough to merge them on its own.
This is also why "just embed every entity and cluster by cosine similarity" — the tempting LLM-era shortcut — underperforms on the node layer. Embeddings blur exactly the distinction resolution depends on. Two different people with similar names and roles embed close together (a recipe for over-merge), while the same person under a typo'd name or a maiden-versus-married surname can embed surprisingly far apart (a recipe for under-merge). Semantic similarity is a great blocking signal and a poor matching decision. The decision needs field-level, weighted, explainable evidence.
The result
Same engine, same data, real curated config instead of the bare default:
| Approach (realistic multi-source CRM) | Precision | Recall | F1 |
|---|---|---|---|
| Bare zero-config matching | 0.07 | 0.96 | 0.13 |
| Curated, column-aware config | high | — | 0.84 |
F1 climbs from 0.13 to 0.84. You cannot reach an F1 of 0.84 with a precision of 0.07 — so precision recovered by more than an order of magnitude, which is the entire ballgame for a knowledge graph. The nodes are now mostly right. On a clean single-source academic fixture (Febrl, 500 records) the same curated path scores 0.83, and the broader lesson is honest: the right configuration is data-shape-dependent. There is no single global "dedupe" setting that is correct for both a clean census extract and a five-system CRM swamp. Resolution is a modeling decision, and treating it like a one-liner is how graphs get poisoned.
A few more things that matter once the graph gets big:
- Scale — native Rust kernels and distributed (Ray / DuckDB) backends, so the matcher keeps up when the graph grows past what fits in one machine's memory.
- Cross-org linkage without sharing PII — privacy-preserving record linkage (PPRL) lets two parties find their overlapping entities without either side handing over raw records, which is how you resolve nodes across organizational boundaries safely.
- A human in the loop for the ambiguous middle — the pairs the matcher isn't sure about don't get silently merged into the graph; they go to a review queue where a person decides, and that decision feeds back into the config.
A note on honesty, because it's the whole point of measuring: those precision and recall numbers come from a fixture with ground truth. On real data without labels, you don't get to claim precision — you get a match rate, plus the tooling to evaluate it. The discipline is the same one a knowledge graph needs: know which of your nodes you've actually verified, and which you're trusting on faith.
Frequently asked questions
Is entity resolution the same as coreference resolution?
No, and a knowledge graph needs both. Coreference resolution links mentions within a single document — "Tim Cook... he... the CEO" all point to the same referent in that text. Entity resolution links records and mentions across the whole corpus or database to one canonical entity — the Tim Cook in this contract is the Tim Cook in that news article and the tcook@ row in your CRM. Coreference cleans up one document; entity resolution is what makes the graph one coherent thing across all of them.
If an LLM builds my graph, do I still need entity resolution?
More than ever. The LLM extracts entities; it does not canonicalize them across documents. Every surface form it emits — every "IBM" and "I.B.M." and "International Business Machines" — is a candidate node until something resolves them. Easier extraction means more mentions, which means a bigger resolution problem, not a smaller one. Extraction and resolution are different jobs; the model only does the first.
Isn't fuzzy string matching enough?
No. Fuzzy matching gets you candidate pairs — it doesn't decide clusters, weight evidence by how discriminating each field is, or handle the transitive chains where A matches B and B matches C, so A, B, and C are one entity. That chaining is what Union-Find resolves. And pure string similarity over-merges different people with similar names while missing the same person under a nickname or a maiden name. Fuzzy matching is one signal inside resolution, not a replacement for it.
How do I know my resolution is actually good?
Measure it against ground truth where you have it — precision, recall, F1 on a labeled set, the way the numbers in this post were produced. Where you don't have labels, track the match rate, sample clusters for human review, and never report a match rate as if it were precision. Keep a person in the loop for the ambiguous middle, and feed those decisions back so the config improves. A graph you can't evaluate is a graph you're trusting on faith.
Key takeaways
- A knowledge graph's correctness is downstream of one decision made millions of times: is this the same entity? The graph doesn't make it — entity resolution does.
- The build pipeline is extract → resolve → construct. Skip resolution and you get a mention graph (one node per spelling), not a knowledge graph. LLM extraction makes that gap wider, because it produces surface forms faster than ever and canonicalizes none of them.
- Over-merging is the silent killer. It makes the graph look healthier while asserting relationships that never existed, and an LLM reading the graph will repeat them as fact.
- Bad resolution is expensive, not just wrong. Duplicates multiply nodes, edges, embeddings, traversal time, and query-time tokens — costs that compound because a graph is all about connection.
- Never resolve entities with a bare zero-config matcher. On realistic multi-source data it over-merged to 0.07 precision (93% wrong); a curated, column-aware config took the same engine to 0.84 F1. Blocking and weighted, explainable scoring are what get you there.
Try it
Entity resolution is the layer worth getting right before you build anything graph-shaped on top of it.
- Run the matcher on your own messy CSV in the playground — no signup.
- Install the engine:
pip install goldenmatch(MIT-licensed on PyPI). - See where resolution fits an end-to-end pipeline: Pipe Your SaaS Data to Your Warehouse.
- Read the concepts: the docs.
Related posts
The OSS vuln-DB 'consensus' is a redistribution artifact
Cross-source remediation agreement looks near-perfect at 99-100%. De-duplicate the mirrors and it collapses to 70%. The lift is 1,898x.
2026-05-17
28 seeds, one corroborated lead: an Epstein-network investigation in public data
What an entity-resolution pipeline finds (and misses) when pointed at 28 publicly-sourced seeds from the Epstein corporate-network reporting.
2026-05-15
Phoenix Spree Deutschland: one cluster from raw leak to GLEIF anchor
A 9-member ICIJ Offshore Leaks cluster, 100% GLEIF-anchored, walked end to end from source rows through GoldenMatch dedupe to a finished report.
2026-05-15