2026-05-15/Ben Severn

88% of actively-exploited CVEs aren't in any package ecosystem

Re-running the OSS vuln reconciliation at 6.1M records and 40 sources surfaces a structural blind spot in every package-level scanner.

entity-resolutiongoldenmatchsecurityvulnerabilitiescisa-kev

CISA's Known Exploited Vulnerabilities catalog is the closest thing the security industry has to ground truth on which CVEs attackers are actually using right now. It's a curated list, hand-maintained by US CISA, with 1,592 entries as of this week. If you read security news, you've read about most of them.

So a defensible question is: of those 1,592 actively-exploited CVEs, how many show up in the OSS vulnerability databases your package scanner queries?

The answer is 188 out of 1,592, or 11.8%. The other 1,404 (88%) have zero ecosystem coverage. Not "they're missing from one database" — missing from all of them. OSV, GHSA, PyPA, RustSec, Go vulndb, every ecosystem-specific feed: none of them carry these CVEs because the affected software isn't distributed through a managed package ecosystem in the first place.

This is a follow-up to the 4/10 post on reconciling 15 OSS vulnerability databases. That run was 869k records across 15 sources and surfaced an asymmetry between GHSA-reviewed and the rest. This run is 6.1M records across 40 sources, runs the full Golden Suite end-to-end on a 64-core GitHub Actions runner, and produces a substantially stronger finding. The headline is the KEV / package-scanner blind spot above. The original post stands as the snapshot at the smaller scale.

What changed in the pipeline

The 4/10 version was a hand-rolled union-find over (vuln_id, aliases) edges. The pipeline is now four Suite stages wired end-to-end via GoldenPipe:

StageLibrary callWhat it does
Checkgoldencheck.scan()Profiles records.parquet, emits DQ health grade + per-column nulls / types / outliers
Normalizegoldenflow.transform()Strips + uppercases vuln_id and aliases so transitive joins survive case noise
Matchgoldenmatch.build_clusters()Union-find + cluster-quality scoring on the alias edge list
Orchestrategoldenpipe.run()Stitches Fetch / Extract / Check / Normalize / Match / Analyze with per-stage status

The hand-rolled parent: dict[str, str] from the 4/10 post is gone. build_clusters does the same union-find with proper path compression, and the cluster-quality scoring catches cases the alias graph alone doesn't (transitive identifier chains across more than one hop, conflicting aliases from different sources for the same canonical vuln).

Three new sources joined the corpus:

Total ingested: 6,126,895 records across 40 sources. Canonical vulns after reconciliation: 847,475. Clusters with 2+ cross-database IDs: 358,170 (42% of canonical vulns reconcile to more than one source).

How the pipeline runs at this scale

The 4/10 pipeline completed in 90 seconds on a laptop. The 5/15 pipeline doesn't fit on a laptop without skipping the 556 MB CVE Project archive, and even with the skip the extract phase takes ~3 minutes against the working set.

The full run targets the large-new-64GB GitHub Actions runner (16 vCPU, 64 GB RAM, 600 GB SSD) and finishes in about 5 minutes including the fetch. The workflow lives at .github/workflows/full-pipeline.yml:

runs-on: large-new-64GB
steps:
  - uses: actions/checkout@v4
  - uses: actions/setup-python@v5
    with: { python-version: '3.12' }
  - run: pip install -r requirements.txt
  - run: python run_pipeline.py
  - uses: actions/upload-artifact@v4
    with: { name: pipeline-outputs, path: output/ }

run_pipeline.py is the GoldenPipe orchestrator. It runs the six stages (Fetch / Extract / Check / Normalize / Match / Analyze) with per-stage status reporting and re-skips any stage whose outputs already exist on disk. The same script works on a laptop with SKIP_CVELIST=1 to skip the CVE Project archive and stay under 4 GB RAM.

Finding 1: 88% of CISA-KEV CVEs are invisible to package scanners

KEV sliceCount% of KEV
Total KEV-listed canonical vulns1,592100%
With any ecosystem coverage (OSV / GHSA / PyPA / RustSec / Go)18811.8%
In github-reviewed (the slice Dependabot surfaces)1207.5%
Known to be used in ransomware campaigns32120.2%
No ecosystem coverage anywhere1,40488.2%

The bugs being exploited in the wild today are overwhelmingly system software: Exchange Server, Cisco IOS, F5 BIG-IP, Fortinet appliances, VMware vCenter, Chrome / Firefox, Linux kernel, Windows components. None of those ship through pip, npm, cargo, or go get. A package scanner running against an SBOM for a containerized application cannot see them by construction — there's no manifest entry for "the Exchange Server my colleague is running."

This is a structural finding, not a coverage gap that can be closed by adding more sources. The OSS vulnerability databases are exactly what they say they are: vulnerabilities in packages. The CVEs that attackers reach for first are in infrastructure. Those are different categories of software with different distribution mechanisms, and no amount of package-scanner improvement bridges the gap.

Full drill-down: output/kev_clusters.json — every KEV-listed cluster with its EPSS percentile, ecosystem coverage status, and ransomware-use flag.

Finding 2: EPSS flags 14,093 high-risk vulns that no package scanner can find

EPSS, FIRST.org's exploit prediction model, scores 326,035 CVEs by predicted probability of exploitation in the next 30 days. Of those, 30,019 sit at the 95th percentile or higher — the model's "imminent exploitation" calls.

Cross-referencing with the reconciled clusters:

EPSS bucketCanonical clustersWith ecosystem
p99+ (top 1%)3,340small fraction
p95-p9913,341small fraction
p90-p9516,678
p50-p90133,383
no EPSS score514,092

14,093 clusters are p95+ EPSS, not in KEV (so not exploited that we know of yet), and have no ecosystem coverage. These are the model's "this is about to get hit" calls that package-level tooling structurally cannot act on. KEV is the lagging indicator (we know it's being exploited because someone reported it). EPSS p95+ is the leading indicator. Both classes are dominated by the same shape: system software outside the ecosystems.

Finding 3: GitHub-reviewed coverage of the full OSS universe is 5.2%

After folding in EPSS, KEV, CVE Project bulk, and 20+ extra OSV ecosystems, the full OSS vulnerability universe expands to 584,148 canonical clusters. Only 30,394 (5.2%) are in the github-reviewed set Dependabot surfaces.

That number went down from 9.1% in the 4/10 run, not because GHSA-reviewed coverage shrank but because the denominator grew faster than the numerator. The same curated 28k-ish set is now divided by a bigger universe. The 9.1% figure from 4/10 was correct for a 312k-vuln denominator; this run's 5.2% is correct for a 584k-vuln denominator. Both are correct snapshots; the latter is a more honest one because the denominator includes the system-software CVEs that actually need addressing.

Finding 4: ecosystem coverage is dramatically asymmetric

EcosystemCanonical vulns
npm218,646
Debian (4 releases combined)~165,000
Ubuntu (10 release pockets combined)~210,000
MinimOS40,117
Chainguard39,363
Wolfi20,494
Linux kernel17,698
PyPI16,604
Maven6,565
Bitnami container images6,242
Packagist (PHP)6,237
Mageia5,911
Go modules4,030
Android3,163
RubyGems2,027
NuGet1,711
crates.io1,575

npm has 13× more tracked vulnerabilities than PyPI and 128× more than NuGet. The distro and container-base ecosystems (Ubuntu, Debian, Chainguard, Wolfi, MinimOS) dominate at the volume level, but most of those rows are rebuilds of the same upstream CVE. That's exactly the kind of disagreement-clustering the reconciliation pipeline collapses: a single Linux kernel CVE shows up under Ubuntu, Debian, Wolfi, and Chainguard distinct rows, which then resolve to one canonical cluster.

Finding 5: famous system-level vulns still have zero ecosystem coverage

This finding from the 4/10 post generalizes cleanly to the new run. Some household-name vulns reconcile to affected packages:

VulnEcosystemsAffected packages
Log4Shell (CVE-2021-44228)Maven5 log4j-derivative packages
Spring4Shell (CVE-2022-22965)Maven5 Spring packages
ZipSlip (CVE-2018-1002105)Gogithub.com/kubernetes/kubernetes

Others — Heartbleed, Shellshock, ProxyShell — still exist as passthrough CVE IDs with no ecosystem and no affected packages. OpenSSL, bash, and Exchange Server haven't moved into managed package ecosystems in the intervening month. The KEV finding above (1) is the same observation generalized from a few famous examples to a 1,404-vuln structural blind spot.

What you should actually do with this

If you ship a containerized application and rely on a package scanner for vulnerability detection, the data above says you're seeing about 12% of the CVEs that are actively being exploited in the wild. The other 88% are in your base image, your kernel, your networking stack, and the management appliances in front of your application. Three observations that follow:

  1. A package scanner is necessary but not sufficient. Run it. Update it. But pair it with image-scanning that catches OS-level CVEs and infrastructure inventorying that catches the appliances. The KEV CVE that hits you is more likely to be in one of those layers than in your requirements.txt.
  2. Don't tune your alerting against EPSS alone. EPSS p95+ is a useful signal but 50% of its top bucket is structurally invisible to package tooling. Tuning against EPSS without filtering for ecosystem coverage produces alerts your scanner can't act on. Tune EPSS against the slice your tooling actually sees.
  3. Trust passthrough CVE IDs less. The 4/10 post observed that GHSA-unreviewed is 297k of GHSA's 350k entries — almost the entire database is passthrough NVD with no GitHub team review. The 5/15 numbers are similar. A "found in GHSA" signal where the underlying entry is unreviewed is a different quality of evidence than a curated GHSA-reviewed vuln, and a lot of tooling collapses both into one severity score.

What this proves about the ER pipeline

The 4/10 post was the first time I ran this shape of pipeline on a non-blockchain dataset. It worked, but the pipeline was a hand-rolled union-find inside analyze.py — GoldenMatch was conceptual rather than load-bearing.

This run is the load-bearing version. The full Suite runs end-to-end: goldencheck.scan() profiles the 6M-row parquet and surfaces DQ findings before any clustering happens; goldenflow.transform() normalizes vuln_id and aliases; goldenmatch.build_clusters() does the union-find with proper path compression and cluster-quality scoring; goldenpipe.run() orchestrates the whole thing with per-stage status reporting. Total wall-clock on the 64-core runner: ~5 minutes for 6.1M rows.

For the wallet-attribution companion post (13M rows), the same library now uses its DuckDB backend to run end-to-end in-database instead of materializing clusters in Python. Different scale, different shape, same Suite.

Reproducibility footer

Previous post in this thread: Reconciling 15 OSS Vulnerability Databases (2026-04-10, the smaller-scale snapshot). Companion run on a different domain: Wallet Attribution at Scale. Install GoldenMatch: pip install goldenmatch. Try the playground: bensevern.dev/playground.