2026-05-17/Ben Severn

The OSS vuln-DB 'consensus' is a redistribution artifact

Cross-source remediation agreement looks near-perfect at 99-100%. De-duplicate the mirrors and it collapses to 70%. The lift is 1,898x.

If you join two public OSS vulnerability databases on (vuln_id, ecosystem, package) and check how often they agree on which version fixes the bug, the answer looks reassuring. In 32,746 multi-source advisory-package groups in the reconciled 6.1M-row corpus, exactly one has a non-trivial fix-version disagreement. 0.003%. The databases agree.

That number is wrong. Not "lower than reported" — methodologically wrong. The 32,746 groups are dominated by OSV's per-ecosystem buckets re-redistributing the same upstream advisories the other side of the join is already publishing. If ghsa-reviewed says CVE-2021-44228 is fixed in log4j-core 2.17.0 and osv-Maven says CVE-2021-44228 is fixed in log4j-core 2.17.0, that's not two sources reaching the same conclusion — it's one source counted twice through a redistribution pipe.

So I redid the agreement test with two changes: join on the CVE alias instead of the source-local vuln_id (because PYSEC-2021-50 and GHSA-2c69-r2jh-xjvm never share a literal vuln_id), and classify each source pair as INDEPENDENT or MIRROR based on what's actually flowing through OSV's pipes. The corrected number is 70.5% agreement, 13.0% true contradiction. The relative risk of finding a disagreement between independent sources versus pure mirror pairs is 1,898×.

This post walks through how that flip happens, why it survives every robustness check I threw at it, and what it implies for anyone consuming "OSV says…" or "GHSA says…" as if they were independent witnesses.

Side-by-side diagram: consensus is three independent witnesses agreeing on a verdict; mirror is one source fed through three redistribution pipes and counted as if it were three witnesses

What "mirror" means in this corpus

OSV.dev is, mechanically, a federation layer. Its per-ecosystem buckets pull from upstream feeds:

OSV bucketPrimary upstream
osv-MavenGHSA (only)
osv-PackagistGHSA (only)
osv-npmGHSA (only)
osv-NuGetGHSA (only)
osv-PyPIGHSA plus PyPA
osv-GoGHSA plus Go vulndb
osv-crates.ioGHSA plus RustSec
osv-Debian / osv-Ubuntu / osv-Alpine / etc.Distro security teams (out of v1 scope for this analysis)

When you join ghsa-reviewed against osv-Maven, you're not joining two sources — you're joining a source against its own redistribution. Same for npm, Packagist, NuGet. Three other OSV buckets (PyPI, Go, crates.io) are mixed: they pull GHSA and an independent feed. The only pair in the corpus where both sides have zero shared upstream is ghsa-reviewed × pypa — GHSA's Python entries vs PyPA's Python entries.

analyze_independence.py classifies the pairs and runs the same set-equality agreement test on each class. The three-tier table is the story:

Pair classCells (both publish a fixed)Any disagreementTrue contradiction
Pure mirror (GHSA × OSV-{Maven,npm,Packagist,NuGet})19,2620.016% (CI95 [0.005, 0.046]%)0 / 19,262
Mixed mirror (GHSA × OSV-{PyPI, Go})8,9148.02% (CI95 [7.48, 8.60]%)0 / 8,914
Independent (GHSA × PyPA)2,65229.56% (CI95 [27.86, 31.33]%)13.08% (347 / 2,652)

Three orders of magnitude separate the pure-mirror disagreement rate from the independent-pair rate. Even the mixed mirrors — where OSV pulls GHSA but also PyPA or Go vulndb — show two orders of magnitude more disagreement than the pure-mirror pairs, because the non-GHSA upstream introduces real source-of-truth divergence into the redistribution.

The test that pins it down

The chi-square comparison is in analyze_convergence_inversion.py and inlines both the Wilson confidence interval and the Yates-corrected chi-square so the math is reviewable rather than hidden behind a scipy.stats import:

chi2 = sum((abs(o - e) - 0.5) ** 2 / e
           for o, e in zip(observed, expected) if e > 0)
p = math.erfc(math.sqrt(chi2 / 2.0))  # df=1

Run on the cloud-built independence.json from today's pipeline run:

Comparisonχ² (df=1)p-valueRelative risk
INDEPENDENT vs PURE_MIRROR (any disagreement)5,869.2below FP floor1,898×
INDEPENDENT vs MIXED_MIRROR (any disagreement)838.91.91e-1843.7×
MIXED_MIRROR vs PURE_MIRROR (any disagreement)1,569.5below FP floor515×
INDEPENDENT vs PURE_MIRROR (true contradiction)2,552.5below FP floor≥ 656× (Wilson-bounded; mirror cell is zero)
INDEPENDENT vs MIXED_MIRROR (true contradiction)1,197.91.72e-262≥ 304×

The "below FP floor" entries don't mean "we couldn't compute it" — they mean χ² is large enough that erfc(sqrt(chi2/2)) underflows IEEE 754 doubles to literal zero. To put a number on it: at χ² = 5,869 on one degree of freedom, the survival function corresponds to roughly 75 standard deviations above the null. The disagreement-rate gap is the most distinguishable two-sample comparison I've ever seen in a security-data context.

The three-pair-class monotone series is the methodology check: if my mirror classification were wrong (say, if osv-Maven were actually shipping independent curation rather than re-stamping GHSA), its disagreement rate would diverge from the other pure-mirror buckets. Instead all four pure-mirror pairs sit at 99-100% agreement, the two mixed mirrors sit at 90-92%, and the lone independent pair sits at 70%. The gradient is the validation.

What this means for tools that consume "the public corpus"

If you maintain a scanner, an SBOM pipeline, or any automation that resolves "is this CVE fixed?" by querying multiple databases and looking for agreement, this changes what agreement means:

  1. GHSA × any pure-mirror OSV bucket = one witness, not two. OSV-Maven, OSV-npm, OSV-Packagist, OSV-NuGet are GHSA passthrough. Treat the union of those four with GHSA as a single source for fix-version corroboration purposes.

  2. OSV-PyPI and OSV-Go are partial second witnesses. They redistribute GHSA and an independent upstream (PyPA, Go vulndb respectively). The ~10% disagreement vs GHSA is the upstream signal leaking through. Worth flagging in pipelines that want a "second source" check.

  3. ghsa-reviewed × pypa is the only fully-independent pair available in the public corpus for fix-version comparison. They disagree on 29.5% of cases where both publish a fix. 13% are true contradictions (sets neither equal nor subset — both sources publish a fix, but they disagree on which version). The other 16.5% are completeness asymmetries — one source tracks more backport branches than the other.

Concrete examples of the contradictions (from output/independence.json):

CVEPackageGHSA-reviewed says fixed inPyPA says fixed in
CVE-2021-32297lief0.11.00.11.5
CVE-2024-34528wordops3.21.03.21.3
CVE-2023-39508apache-airflow2.6.0b1 (beta)2.6.0 (final)

A user on lief 0.11.2 would be told "fixed, you're safe" by GHSA and "still vulnerable" by PyPA. The two sources are looking at the same advisory in the same ecosystem on the same package and reaching different conclusions about which patch release closes the bug.

What this doesn't mean

A few things this finding isn't saying, that you might infer:

The methodological point is narrower: when researchers report "cross-source agreement" without specifying their join key and source-pair independence assumptions, they are almost certainly measuring redistribution fidelity. That's a different (and much less interesting) measurement than "do independent sources reach the same conclusion?"

Honest limitations

Documented in docs/methodology.md of the reconciliation repo, but the load-bearing ones:

Reproducing this

git clone https://github.com/benseverndev-oss/goldenmatch-vuln-attribution
cd goldenmatch-vuln-attribution
python sync_cloud.py          # pulls latest cloud-built parquet + JSONs
cat output/convergence_inversion.json | jq .per_class

Or, against the latest GitHub release directly (no clone, no auth):

curl -L -O https://github.com/benseverndev-oss/goldenmatch-vuln-attribution/releases/download/latest/convergence_inversion.json

Every number in this post is reproducible from the release. The pipeline that produces it runs on a GitHub Actions large-new-64GB runner in about 6 minutes including the 1.5 GB fetch.

Key takeaways

Where this fits in the series

This is post 1 of three on the vulnerability reconciliation work. Post 2 — "Only 7.5% of CVEs are package-representable" — formalizes which CVEs you can even express in package-version semantics, and why CISA KEV-with-ransomware drops to 4.4%. Post 3 — "SBOM scanning with three-state verdicts" — walks through check_affected.py, the operator-facing matcher that lets you ask "am I affected at version X?" of a CycloneDX SBOM and get back AFFECTED / NOT_AFFECTED / UNKNOWN with per-interval evidence.

Source code, the public release, the methodology doc, and the inter-rater κ for the qualitative bucketing: github.com/benseverndev-oss/goldenmatch-vuln-attribution.

Related posts