2026-06-19/Ben Severn

Why Knowledge Graphs Live or Die on Entity Resolution

A knowledge graph is only as good as its entities. Why bad entity resolution wrecks KG quality and cost, and how GoldenMatch solves the node layer.

Picture two people named John Smith. One signed a contract with Acme in 2019, the other joined a competitor in 2023. Your knowledge graph collapses them into a single node. Now the graph asserts — as fact, with a clean typed edge — that one person did both. Feed that graph to an LLM and it will tell you so, confidently, with a citation. The citation points at your own data.

That is the failure mode nobody demos. Entity resolution — deciding which records refer to the same real-world thing — is the quietest, most load-bearing layer of any knowledge graph. Get it right and the graph reasons. Get it wrong and you have built a very expensive machine for laundering bad data into confident answers.

This post walks the whole chain: what a knowledge graph is, why it suddenly matters again, how graphs actually get built (and where the duplicates sneak in), exactly how bad entity resolution corrupts a graph in both quality and cost, and how GoldenMatch approaches the node layer — with real numbers from a benchmark that has ground truth.

What a knowledge graph actually is

Strip the marketing and a knowledge graph is three things:

The unit of value is the triple: subject, predicate, object. (Acme) —[ACQUIRED]→ (Beta Corp). Stack millions of those and you get something a row-and-column table can't give you: meaning that you can traverse.

The critical word in the node definition is exactly one. A relational table tolerates five rows for the same customer — you GROUP BY later and move on. A graph does not give you that grace. Five nodes for one customer is five different "people" as far as every traversal, every embedding, and every LLM reading the graph is concerned. The graph's correctness is downstream of one decision made millions of times: is this the same entity or not?

That decision is entity resolution. The graph doesn't make it. Something upstream has to.

Why knowledge graphs are valuable

Tables answer "what." Graphs answer "how is this connected to that," which is where most real questions actually live:

The payoff is explainable retrieval. A graph can show its work: here is the path, here are the edges, here is why the answer is the answer. That property is exactly why the AI world came back for knowledge graphs.

Why they're suddenly everywhere again

Knowledge graphs are not new — Google shipped the term in 2012. What changed is that large language models gave them a killer app, and the timing was 2024 through 2025.

The first wave of retrieval-augmented generation was pure vector search: chunk your documents, embed them, retrieve the nearest fuzzy neighbors, stuff them in the prompt. It works until the question needs connected facts rather than similar text. "Summarize everything related to this customer across all our systems" is not a similarity query — it's a graph traversal. Vector RAG retrieves passages that look alike; it cannot tell you that two passages are about the same person under two different spellings.

Graph-augmented retrieval — popularized by Microsoft's GraphRAG and a wave of follow-ons — closes that gap. Retrieve a node, walk its edges, hand the model precise, connected, deduplicated facts instead of a pile of overlapping chunks.

Vector RAGGraph RAG
Retrieval unittext chunkentity + its edges
Findssemantically similar passagesconnected facts
Multi-hop questionsweak (each hop is a new fuzzy search)native (traverse the edges)
Source dedupnone — returns overlapping chunksone node per entity (if ER is good)
Explainability"these chunks scored near your query""this path connects A to B"
Primary failure modemisses the connectionpoisoned by bad entity resolution

Look at the last row. Vector RAG's weakness is omission — it just doesn't find the link. Graph RAG's weakness is commission — it confidently serves a link that your entity resolution invented. The hype moved the bottleneck onto the node layer; it didn't remove it.

How knowledge graphs get built — and where the duplicates come from

A graph gets populated two ways, and both manufacture duplicates.

Structured ingestion pulls rows from systems you already own — CRMs, warehouses, billing. The duplicates here are the classic kind: the same customer in HubSpot and Salesforce, the same vendor entered twice with a typo. Familiar, and exactly what entity resolution was built for.

Unstructured extraction is the new firehose, and it's the one the LLM era cranked wide open. You point a model at ten thousand documents — contracts, news, support tickets, filings — and ask it to emit entities and relationships. It's astonishingly good at the extraction. It is, by design, terrible at canonicalization. Read enough text and the same company comes back as:

IBM
I.B.M.
International Business Machines
IBM Corp.
International Business Machines Corporation

Five strings, one company. The extraction step has no memory across documents — each mention is judged on its own, so each surface form becomes a candidate node. Nothing in the LLM's extraction pass knows these are the same entity. The model made it trivial to produce mentions and did nothing to reconcile them.

So the real knowledge-graph build pipeline has three stages, not two:

  1. Extract — pull entities and relations from structured rows or unstructured text.
  2. Resolve — collapse all the mentions and records that refer to the same real-world thing into one canonical entity. This is entity resolution.
  3. Construct — write the resolved entities as nodes and the relations as edges.

Skip stage 2 and you don't get a knowledge graph. You get a mention graph — one node per spelling, with the real connections smeared across the duplicates. Every serious GraphRAG implementation names this stage explicitly ("entity resolution," "entity disambiguation," "canonicalization") because without it the graph doesn't cohere. And here's the part people miss: making extraction easy with an LLM makes the resolution problem bigger, not smaller. More documents, more surface forms, more duplicates to reconcile.

Why bad entity resolution wrecks a knowledge graph

Bad resolution fails in two directions, and both are expensive.

Over-merging glues distinct entities into one node. This is the John-Smith case, and it is the more dangerous failure because it is invisible. The graph looks healthier — fewer nodes, denser edges, tidy. But every fact from entity A is now asserted about entity B. Edges that never existed in reality now exist in the graph. An LLM reading it has no way to know; merged is merged. You have manufactured false relationships and given them the authority of structured data.

Under-merging does the opposite: one real entity fragments across many nodes. Now the multi-hop path that should connect two facts is broken, because the connecting entity exists three times and none of the copies hold the full picture. The honest-looking answer becomes "not in the graph" — when it absolutely is in the graph, just smeared across duplicates.

A worked example

Here are three records, the kind that land in a multi-source pool:

Record (source)NameEmailPhoneCompany
r1 (HubSpot)John Smithj.smith@acme.com555-0142Acme Corp
r2 (Salesforce)John Smithjohn.smith@beta.io555-0142Beta Industries
r3 (Mailchimp)J. Smithj.smith@acme.comAcme

The truth: r1 and r3 are the same person (Acme John — same email, same employer). r2 is a different John Smith at Beta who happens to share a phone number, because 555-0142 is a reused office line. The correct graph has two person nodes.

Now resolve it badly — key on exact phone, treat a phone match as strong evidence. r1 and r2 share 555-0142, so they merge. r3 joins on email. All three collapse into one node, and the graph dutifully writes both employment edges:

(John Smith) —[WORKS_AT]→ (Acme Corp)
(John Smith) —[WORKS_AT]→ (Beta Industries)

Ask the graph the question it exists to answer — "find anyone connected to both Acme and a competitor" — and it surfaces this phantom John Smith as a hit. Someone acts on a relationship that was never real. That single bad merge didn't just add a duplicate; it injected a false fact into a system whose entire value proposition is that its facts are trustworthy.

The same failure, at benchmark scale

We benchmark GoldenMatch against a synthetic-but-realistic multi-source CRM fixture — 467 records, 496 known-true match pairs across 180 real people, with the full mess of real data: nicknames, initials, typos, maiden and married surnames, work-versus-personal email, phone-format drift, company-suffix variants, missing fields. It has ground truth, so precision and recall are real, not vibes.

Run a naive zero-config matcher against it — the kind of "just dedupe it" call that looks reasonable in a notebook — and the autoconfig latches onto exact-phone and a source-id field as match keys and over-merges exactly like the worked example, but 467 times:

Approach (realistic multi-source CRM)PrecisionRecallF1
Bare zero-config matching0.070.960.13

A precision of 0.07 means 93% of the merges are wrong. Recall looks gorgeous at 0.96 because when you merge nearly everything, you do technically catch all the true pairs — along with a flood of false ones. Drop that output into a knowledge graph and you haven't built a graph; you've built a blender. Every over-merged node poisons every query that touches it.

The part the quality conversation skips: cost

Bad entity resolution doesn't add cost linearly. It compounds, because the graph's whole point is connection, and you've made the wrong things connect.

How GoldenMatch approaches the node layer

GoldenMatch is the entity-resolution engine — the part that decides what the nodes are before anything builds a graph on top. The pipeline is four stages:

  1. Blocking — group records that share a high-confidence key (normalized email, phone, tax id) so you never compare all N-squared pairs.
  2. Scoring — for each candidate pair, run field-aware scorers (name as token similarity, email exact, address token-sort, phone as normalized E.164) and combine them with Fellegi-Sunter probabilistic weights.
  3. Clustering — Union-Find over the pairs above threshold, so a chain of pairwise matches resolves into one coherent entity.
  4. Survivorship — pick the surviving value per field per cluster, producing one golden record per real-world thing.

The output of stage 3 is, quite literally, a graph. Internally, resolution emits explicit match edges — (src_record, dst_record, score) — and the entities are the connected components over those edges:

# the edge an entity-resolution run actually produces
Edge = tuple[str, str, float]  # (src_record_id, dst_record_id, match_score)

# entities = connected components over the retained edges
def compute_components(record_ids, edges):
    # Union-Find: every record lands in exactly one component;
    # records touched by no edge become their own singleton.
    ...

Every edge carries its score and its provenance — which run produced it, which fields fired. That matters for a knowledge graph because it means the node-formation decision is itself auditable. You can ask why two records became one entity, instead of trusting a black box:

import goldenmatch as gm

# explain why a cluster merged, in plain language
reason = gm.explain_cluster_nl(cluster, df, matchkeys)
print(reason)
# -> "Merged on exact email + high name similarity (0.94);
#     address agreed after standardization."

Blocking and scorer weighting decide everything at scale

Two of those four stages are where graphs are won or lost, and they're the two the naive approach gets wrong.

Blocking is not optional once the graph is big. A graph with 10 million nodes has on the order of fifty trillion possible pairs. You cannot score them all, and you shouldn't want to — almost none are matches. Blocking groups records by a high-confidence key so you only compare the thousands of pairs that could plausibly be the same entity. The blocking key is itself a quality lever: too loose and you drown in comparisons, too tight and you never put two true matches in the same block, so they can never merge. Good blocking is what makes resolution tractable and recall-preserving at the same time.

Scorer weighting is why the over-merge happened — and how it's fixed. The bare run collapsed because it treated an exact phone match as strong evidence. Fellegi-Sunter weights each field by how discriminating it actually is: it learns, per field, how often agreement happens among true matches versus by random chance. A shared rare surname is powerful evidence. A shared common phone format, or a phone number that turns out to be a reused office line, is weak. The fix for the 0.07 disaster is a curated, column-aware config: exclude the source-id and per-source identifier columns from matching, up-weight surname and email, and demote phone to a blocking-only signal — good enough to put records in the same candidate block, not strong enough to merge them on its own.

This is also why "just embed every entity and cluster by cosine similarity" — the tempting LLM-era shortcut — underperforms on the node layer. Embeddings blur exactly the distinction resolution depends on. Two different people with similar names and roles embed close together (a recipe for over-merge), while the same person under a typo'd name or a maiden-versus-married surname can embed surprisingly far apart (a recipe for under-merge). Semantic similarity is a great blocking signal and a poor matching decision. The decision needs field-level, weighted, explainable evidence.

The result

Same engine, same data, real curated config instead of the bare default:

Approach (realistic multi-source CRM)PrecisionRecallF1
Bare zero-config matching0.070.960.13
Curated, column-aware confighigh0.84

F1 climbs from 0.13 to 0.84. You cannot reach an F1 of 0.84 with a precision of 0.07 — so precision recovered by more than an order of magnitude, which is the entire ballgame for a knowledge graph. The nodes are now mostly right. On a clean single-source academic fixture (Febrl, 500 records) the same curated path scores 0.83, and the broader lesson is honest: the right configuration is data-shape-dependent. There is no single global "dedupe" setting that is correct for both a clean census extract and a five-system CRM swamp. Resolution is a modeling decision, and treating it like a one-liner is how graphs get poisoned.

A few more things that matter once the graph gets big:

A note on honesty, because it's the whole point of measuring: those precision and recall numbers come from a fixture with ground truth. On real data without labels, you don't get to claim precision — you get a match rate, plus the tooling to evaluate it. The discipline is the same one a knowledge graph needs: know which of your nodes you've actually verified, and which you're trusting on faith.

Frequently asked questions

Is entity resolution the same as coreference resolution?

No, and a knowledge graph needs both. Coreference resolution links mentions within a single document — "Tim Cook... he... the CEO" all point to the same referent in that text. Entity resolution links records and mentions across the whole corpus or database to one canonical entity — the Tim Cook in this contract is the Tim Cook in that news article and the tcook@ row in your CRM. Coreference cleans up one document; entity resolution is what makes the graph one coherent thing across all of them.

If an LLM builds my graph, do I still need entity resolution?

More than ever. The LLM extracts entities; it does not canonicalize them across documents. Every surface form it emits — every "IBM" and "I.B.M." and "International Business Machines" — is a candidate node until something resolves them. Easier extraction means more mentions, which means a bigger resolution problem, not a smaller one. Extraction and resolution are different jobs; the model only does the first.

Isn't fuzzy string matching enough?

No. Fuzzy matching gets you candidate pairs — it doesn't decide clusters, weight evidence by how discriminating each field is, or handle the transitive chains where A matches B and B matches C, so A, B, and C are one entity. That chaining is what Union-Find resolves. And pure string similarity over-merges different people with similar names while missing the same person under a nickname or a maiden name. Fuzzy matching is one signal inside resolution, not a replacement for it.

How do I know my resolution is actually good?

Measure it against ground truth where you have it — precision, recall, F1 on a labeled set, the way the numbers in this post were produced. Where you don't have labels, track the match rate, sample clusters for human review, and never report a match rate as if it were precision. Keep a person in the loop for the ambiguous middle, and feed those decisions back so the config improves. A graph you can't evaluate is a graph you're trusting on faith.

Key takeaways

Try it

Entity resolution is the layer worth getting right before you build anything graph-shaped on top of it.

Related posts