2026-04-13/Ben Severn

GoldenCheck Now Runs in TypeScript: Zero-Config Data Validation at the Edge

GoldenCheck's 10 profilers, drift detection, and confidence scoring ship on npm with an edge-safe core. DQBench 88.40 — now in your browser, Workers, and Node.js.

goldenchecktypescriptdata-qualityopen-sourcenpmedge-functions

Most data validation tools start with a blank YAML file and ask you to write rules. How many nulls are acceptable in email? What's the valid range for age? Is status an enum? You don't know yet — you're staring at a CSV you've never seen before.

GoldenCheck flips the model. Hand it a file, and it discovers what's wrong — no rules, no config, no schema. Ten column-level profilers and four cross-column profilers run automatically, each producing findings with severity levels and confidence scores. You review what it found, pin the rules you care about, and export a goldencheck.yml for CI enforcement. Rules come from data, not guesswork.

On Python, GoldenCheck scores 88.40 on DQBench — zero-config, no hand-written rules. For context, Great Expectations scores 21.68 with best-effort rules. Pandera scores 32.51. The gap isn't close.

Today GoldenCheck ships on npm as a full TypeScript port. This post covers what it detects, how the port is structured, and what edge-safe data validation unlocks.

What GoldenCheck Catches

Column-Level Profilers (10 checks)

Profiler	What It Flags
Type Inference	String columns that are mostly numeric; numeric columns losing leading zeros
Nullability	Required columns (0 nulls), optional columns, entirely-null columns
Uniqueness	Primary key candidates, near-unique columns with sneaky duplicates
Format Detection	Emails, phones, URLs — flags non-matching values in classified columns
Range & Distribution	Min/max violations, outliers beyond 3 standard deviations
Cardinality	Low-cardinality enum candidates (<20 unique values)
Pattern Consistency	Mixed structural patterns — `XXX-DDDD` vs `DDDDDDDDDD` in the same column
Encoding Detection	Mojibake, zero-width Unicode, smart quotes, control characters
Sequence Detection	Gaps in auto-increment IDs, broken integer sequences
Drift Detection	Temporal distribution shifts within a single column

Cross-Column Profilers (4 checks)

Profiler	Example
Temporal Order	`signup_date` after `last_login`
Null Correlation	`address`, `city`, and `zip` always null together
Numeric Cross-Column	`claim_amount` exceeds `policy_max`
Age vs DOB	`age = 30` but `date_of_birth = 1980-01-01`

Every finding gets a confidence score from 0.0 to 1.0. If two profilers flag the same column, the confidence gets a corroboration boost. Low-confidence findings are automatically demoted to INFO severity — no noise in your error list.

The TypeScript Port

Edge-safe core, Node extras separate

The package ships with two entry points:

goldencheck/core — all 14 profilers, baseline, drift detection, confidence scoring, semantic classification. Zero Node.js APIs. Runs in browsers, Cloudflare Workers, Vercel Edge Functions.
goldencheck/node — file I/O (CSV, Parquet, Excel), CLI, MCP server. Requires Node.js.

// Edge-safe — works in browsers and Workers
import { scanData, TabularData } from "goldencheck/core";

const data = new TabularData([
  { email: "alice@example.com", age: 30, status: "active" },
  { email: "bad@",              age: -5, status: "active" },
  { email: null,                age: 25, status: "pending" },
  { email: "bob@company.org",   age: 200, status: "unknown" },
]);

const { findings, profile } = scanData(data);

for (const f of findings) {
  console.log(`[${f.severity}] ${f.column}: ${f.message}`);
}
// [ERROR]   email: 1 value does not match email format (row 2: "bad@")
// [WARNING] age: value -5 is below inferred minimum 0
// [WARNING] age: value 200 exceeds 3 standard deviations from the mean
// [WARNING] status: "unknown" not in inferred enum [active, pending]

No rules written. No config file. Four findings from four lines of data.

Node.js file scanning

import { readFile, scanData } from "goldencheck/node";

const data = readFile("customers.csv");
const { findings } = scanData(data);

const errors = findings.filter((f) => f.severity === "ERROR");
if (errors.length > 0) {
  console.error(`${errors.length} errors found`);
  process.exit(1);
}

Baseline and drift detection

GoldenCheck's baseline system captures what "healthy" data looks like using six independent techniques — statistical profiles, constraint mining, semantic types, correlations, pattern grammars, and confidence priors. Once you have a baseline, every subsequent scan detects drift across 13 dimensions.

import { createBaseline, serializeBaseline } from "goldencheck/node";
import { runDriftChecks, deserializeBaseline } from "goldencheck";
import { readFileSync, writeFileSync } from "node:fs";

// Learn from your reference data
const baseline = createBaseline(readFile("reference_data.csv"));
writeFileSync("baseline.json", serializeBaseline(baseline));

// Later: check production data for drift
const saved = deserializeBaseline(readFileSync("baseline.json", "utf-8"));
const driftFindings = runDriftChecks(readFile("production_export.csv"), saved);

for (const f of driftFindings) {
  console.log(`${f.check}: ${f.message}`);
}
// distribution_drift: age distribution has shifted (KS p=0.002)
// new_pattern: phone column has new pattern "DDDDDDDDDD" (was "DDD-DDD-DDDD")
// bound_violation: salary max 450000 exceeds historical max 280000

The 13 drift check types: distribution shift (KS-test), entropy drift, bound violations, Benford's Law deviation, functional dependency breaks, key uniqueness loss, temporal order drift, type drift, correlation breaks, new correlations, pattern drift, and new patterns.

Domain dictionaries

Domain packs add industry-specific semantic type detection:

import { scanData, readFile } from "goldencheck/node";

const { findings } = scanData(readFile("claims.csv"), { domain: "healthcare" });
// Now recognizes NPI, ICD codes, CPT codes, DRG, clinical notes, patient types

Three built-in domains: healthcare, finance, and ecommerce. Each adds 8-15 additional semantic types that the base classifier doesn't cover.

LLM boost

For an optional ~$0.01 per scan, GoldenCheck sends representative data samples to Claude or GPT-4o-mini for semantic analysis. The LLM catches things profilers can't — like 12345 in a name column, or an email field that should be required but has 2% nulls.

import { scanData, callLlm, parseLlmResponse, mergeLlmFindings, buildSampleBlocks } from "goldencheck";

const result = scanData(data, { returnSample: true });
const blocks = buildSampleBlocks(result.sample, result.findings);
const { text } = await callLlm("anthropic", JSON.stringify(blocks));
const llmResponse = parseLlmResponse(text);
if (llmResponse) {
  const enhanced = mergeLlmFindings(result.findings, llmResponse);
}

LLM findings merge into the existing confidence system — they don't replace profiler results, they augment them. Budget control via GOLDENCHECK_LLM_BUDGET keeps costs predictable.

Why This Architecture Matters

Upload-time validation

When a user uploads a CSV to your app, you can validate it before it reaches your backend. Run scanData in an edge function, return findings to the client, and let the user fix issues before the data enters your pipeline.

// Next.js Edge API route
import { scanData, TabularData } from "goldencheck/core";

export const runtime = "edge";

export async function POST(req: Request) {
  const { rows } = await req.json();
  const { findings } = scanData(new TabularData(rows));

  const errors = findings.filter((f) => f.severity === "ERROR");
  return Response.json({
    valid: errors.length === 0,
    findings: findings.map((f) => ({
      column: f.column,
      severity: f.severity,
      message: f.message,
      confidence: f.confidence,
    })),
  });
}

No Python service to cold-start. No container to manage. Validation runs at the edge in milliseconds.

CI/CD data gates

Combine zero-config discovery with config-based enforcement:

# First time: discover issues and save rules
npx goldencheck data.csv --json > findings.json
# Review, edit goldencheck.yml

# CI: enforce rules on every push
npx goldencheck validate data.csv --config goldencheck.yml
# Exit code 1 if any ERROR-level findings

Browser-based data previews

The zero-dependency core means you can run all 14 profilers client-side. Upload a CSV in the browser, scan it locally, and show the user a health score and findings table — no network round-trip, no data leaving the client's machine.

DQBench: How It Compares

DQBench tests three tiers of data quality issues with increasing difficulty. GoldenCheck runs zero-config — no rules written. Competitors use their best-effort hand-written rules.

Tool	Mode	T1 F1	T2 F1	T3 F1	DQBench Score
GoldenCheck	zero-config	94.1%	90.9%	83.0%	88.40
Pandera	best-effort rules	36.4%	38.1%	25.0%	32.51
Soda Core	best-effort rules	38.1%	23.5%	13.3%	22.36
Great Expectations	best-effort rules	36.4%	23.5%	12.5%	21.68

The gap is largest on Tier 3 — adversarial issues like mixed encodings, broken sequences, and correlated nulls. These are the issues that rule-based tools miss because nobody writes rules for problems they don't know exist.

What's Ported, What's Not

Fully ported: all 14 profilers, semantic classifier with 3 domain packs, baseline system (6 techniques), drift detection (13 checks), LLM boost, confidence scoring, auto-fix, config validation, health scoring, MCP server, JSON/CI reporters.

Not yet ported: the interactive TUI (Textual in Python, would need Ink or a React-based equivalent), database scanning, REST API server, watch mode, scheduled runs. These are Node-only features that will come in future releases.

The core scanning engine — the part that actually finds issues — is at full parity.

Install and Try It

npm install goldencheck

Scan a file:

npx goldencheck data.csv --no-tui

Or in code:

import { scanData, TabularData } from "goldencheck/core";

const { findings } = scanData(new TabularData(yourData));

The GitHub repo has the full benchmark suite, 296 Python tests, and the DQBench comparison. The Python version is still on PyPI (pip install goldencheck) — both are maintained actively.

Key Takeaways

GoldenCheck discovers data quality issues from your data — no rules to write upfront
The TypeScript core has zero runtime dependencies and runs in Edge Functions, Workers, and browsers
14 profilers catch type mismatches, encoding issues, pattern inconsistencies, range violations, correlated nulls, temporal order breaks, and more
Baseline + drift detection (13 check types) lets you monitor data quality over time with statistical rigor
DQBench 88.40 zero-config vs. Great Expectations 21.68 with hand-written rules

Data validation shouldn't require you to know every possible issue before you scan. GoldenCheck finds the issues first, then lets you decide which ones matter. Now it does that in TypeScript too.

← Back to blog