groundcheck

The open, domain-aware reference verification standard.

Because a Reuters article and a Nature paper need different verification criteria.

Why this exists · Quick start · How it works · Domain scoring · API · Prior art · Contributing

The Problem

Most citation verification systems apply a single fixed formula to every source:

score = doi × 0.40 + title_search × 0.30 + url × 0.10 + ai × 0.20

This is broken for anything that isn't an academic paper. DOI and academic title search are irrelevant for a New York Times article, which means a live, credible Reuters story scores at most 0.23 against a 0.65 threshold and is always marked as removed.

News sources silently end up with zero references.

The Fix

Domain-aware scoring. Each source is classified into one of five domains first, then scored by the layers and weights appropriate for that domain.

ACADEMIC   →  doi(0.45) + title_search(0.30) + url(0.10) + ai(0.15)  ≥ 0.70
NEWS       →  url(0.35) + ai(0.65)                                    ≥ 0.50
GOVERNMENT →  url(0.40) + ai(0.60)                                    ≥ 0.55
EDUCATIONAL→  url(0.30) + title_search(0.10) + ai(0.60)              ≥ 0.50
GENERAL    →  url(0.30) + title_search(0.10) + ai(0.60)              ≥ 0.55

Concrete result: A live NYT article:

Old (fixed weights): 0.10×0.6 + 0.20×0.85 = 0.23 → REMOVED ❌
New (domain-aware NEWS): 0.35×0.6 + 0.65×0.85 = 0.76 → VERIFIED ✅

Paywalled article (403 response): AI alone scores 0.65 × 0.85 = 0.5525 > 0.50 → VERIFIED ✅

Quick Start

npm install groundcheck

import {
  classifyReference,
  computeDomainAwareScore,
  DOMAIN_CONFIGS,
} from 'groundcheck';

// Step 1: classify the reference
const domain = classifyReference({
  doi: null,
  url: 'https://www.nytimes.com/2024/01/climate.html',
  type: 'ARTICLE',
});
// → 'NEWS'

// Step 2: run your verification layers (URL check, AI eval, etc.)
const layerResults = [
  { layerId: 'url', passed: true, confidence: 0.6 },
  { layerId: 'ai',  passed: true, confidence: 0.85 },
];

// Step 3: compute domain-aware score
const { score, verdict } = computeDomainAwareScore(domain, layerResults);
// → { score: 0.7625, verdict: 'VERIFIED' }

// Optional: access domain config (AI instructions, URL patterns, etc.)
const config = DOMAIN_CONFIGS[domain];
console.log(config.aiInstruction);
// → "Verify this is from a credible news outlet..."

v2 (Bayesian): Use computeBayesianScore for a probabilistic posterior with per-layer explainability. See the API reference for computeBayesianScore.

How Bayesian Scoring Works (Plain English)

Not a stats person? Here is the intuition behind computeBayesianScore.

The core idea

Start with a gut feeling, a starting probability, then update it with evidence. Each verification check nudges your confidence up or down. The result is a single probability (for example "81% chance this reference is real"), not a weighted percentage.

Starting confidence (the prior)

Each domain starts with a different base probability before any checks run. These reflect how often AI-generated content hallucinates references in that domain:

Domain	Prior	Why
GOVERNMENT	82%	Official government sources are rarely fabricated
NEWS	75%	Established outlets are usually real; moderate hallucination risk
ACADEMIC	72%	Papers are generally genuine; fabrication exists but is less common
GENERAL	45%	Anonymous web content has high hallucination risk, so it gets a lower starting confidence

How each check updates your confidence

Every verification layer has two diagnostic properties:

Property	Plain English	What it means
Sensitivity	Hit rate	How often does this check pass for a real reference? High means it rarely misses real refs
Specificity	Fake-catcher rate	How often does this check fail for a fake reference? High means it rarely lets fakes through

A layer with high sensitivity AND high specificity is highly informative. For NEWS, the AI layer (sensitivity 0.82, specificity 0.80) carries far more signal than the URL check (sensitivity 0.55, specificity 0.85), because news articles are commonly paywalled, so a failed URL is weak evidence of fakeness.

Evidence accumulates

The algorithm keeps a running tally in log-odds, a representation where you can simply add and subtract evidence instead of multiplying probabilities. At the end, it converts back to a normal probability from 0% to 100%.

Example, a paywalled NYT article (NEWS domain):

Step	Evidence	Running probability
Prior	NEWS domain, moderate hallucination risk	75%
URL 403 (confidence = 0)	Paywalled; credible outlets often return 403	~61%
AI confirms credible outlet (confidence = 0.85)	Strong positive signal	~81%
Verdict	81% ≥ 65% Bayesian threshold	✅ VERIFIED

A broken URL from a known outlet barely disqualifies the reference. Strong AI confirmation brings the probability to 81%, which clears the 65% threshold for NEWS.

Why not just use v1 (weighted sum)?

v1 is simpler and faster. v2 adds three things:

A domain-calibrated starting estimate. The prior accounts for base rates of hallucination by content type.
Principled evidence combination. Bayes' theorem handles asymmetric layers gracefully (a weak layer barely moves the posterior; a strong layer moves it a lot).
Per-layer explainability. logOddsContributions shows exactly which check helped and which hurt, making failures debuggable.

For most references, v1 and v2 agree. The difference shows up in edge cases: a paywalled article from a credible outlet, or a reference with strong AI support but a broken URL.

→ See the full API docs for computeBayesianScore

Domain Scoring

Column guide: LR+ = sensitivity / (1 - specificity); LR- = (1 - sensitivity) / specificity. Higher LR+ means a confident pass is stronger evidence of a real reference; lower LR- means a confident fail is stronger evidence of a fake.

ACADEMIC

Peer-reviewed papers, preprints, books, technical reports

v1 threshold: ≥ 0.70 | v2 prior: 0.72 | v2 bayesianThreshold: ≥ 0.82

Layer	v1 Weight	Sensitivity	Specificity	LR+	LR-
`doi`	0.45	0.92	0.97	30.67	0.08
`title_search`	0.30	0.80	0.88	6.67	0.23
`url`	0.10	0.70	0.72	2.50	0.42
`ai`	0.15	0.78	0.82	4.33	0.27

Classified by: DOI present, arXiv/PubMed/Nature/IEEE URL, PAPER/BOOK type

NEWS

Established news outlets (NYT, Reuters, BBC, AP, Guardian, Bloomberg…)

v1 threshold: ≥ 0.50 | v2 prior: 0.75 | v2 bayesianThreshold: ≥ 0.65

Layer	v1 Weight	Sensitivity	Specificity	LR+	LR-
`url`	0.35	0.55	0.85	3.67	0.53
`ai`	0.65	0.82	0.80	4.10	0.23

Classified by: Reuters/NYT/BBC/AP/Guardian/Bloomberg/FT URL pattern, ARTICLE type · Lower v1 threshold because credible outlets often return 403/paywall

Paywall math: 0.65 × 0.85 = 0.5525 > 0.50. A credible outlet passes via AI even with a dead URL.

GOVERNMENT

Official government reports, legislation, statistics

v1 threshold: ≥ 0.55 | v2 prior: 0.82 | v2 bayesianThreshold: ≥ 0.72

Layer	v1 Weight	Sensitivity	Specificity	LR+	LR-
`url`	0.40	0.85	0.93	12.14	0.16
`ai`	0.60	0.80	0.84	5.00	0.24

Classified by: .gov, who.int, un.org, worldbank.org, oecd.org URL patterns

GENERAL

Wikipedia, blogs, videos, podcasts, and other web content

v1 threshold: ≥ 0.55 | v2 prior: 0.45 | v2 bayesianThreshold: ≥ 0.68

Layer	v1 Weight	Sensitivity	Specificity	LR+	LR-
`url`	0.30	0.65	0.70	2.17	0.50
`title_search`	0.10	0.30	0.75	1.20	0.93
`ai`	0.60	0.72	0.78	3.27	0.36

Classified by: Catch-all for anything not classified above

Classification Logic

classifyReference follows a strict priority order:

flowchart TD
    Start["Reference { doi, url, type }"] --> Q1{"DOI present?"}
    Q1 -->|yes| ACAD["ACADEMIC"]
    Q1 -->|no| Q2{"URL matches ACADEMIC patterns?"}
    Q2 -->|yes| ACAD
    Q2 -->|no| Q3{"URL matches NEWS patterns?"}
    Q3 -->|yes| NEWS["NEWS"]
    Q3 -->|no| Q4{"URL matches GOVERNMENT patterns?"}
    Q4 -->|yes| GOV["GOVERNMENT"]
    Q4 -->|no| Q5{"Type matches ACADEMIC types?"}
    Q5 -->|yes| ACAD
    Q5 -->|no| Q6{"ARTICLE type with matching URL?"}
    Q6 -->|yes| NEWS
    Q6 -->|no| GEN["GENERAL (fallback)"]

API Reference

`classifyReference(ref)`

Classify a reference into a content domain.

function classifyReference(ref: {
  doi?: string | null;
  url?: string | null;
  type?: string | null;
}): ContentDomain

`computeDomainAwareScore(domain, layerResults)` v1

Compute a weighted-sum score for a given domain.

function computeDomainAwareScore(
  domain: ContentDomain,
  layerResults: LayerResult[]
): { score: number; verdict: 'VERIFIED' | 'FAILED' }

score is between 0 and 1. verdict is 'VERIFIED' if score >= domain.threshold, 'FAILED' otherwise.

Layer results for layers not applicable to the domain are ignored.

`computeBayesianScore(domain, layerResults)` v2

Compute a Bayesian posterior probability using log-odds updating.

function computeBayesianScore(
  domain: ContentDomain,
  layerResults: LayerResult[]
): {
  posterior: number;              // P(reference is real given evidence), 0.0 to 1.0
  verdict: 'VERIFIED' | 'FAILED'; // posterior >= domain.bayesianThreshold
  logOddsContributions: Record<string, number>; // per-layer Δ log-odds (for transparency)
}

Algorithm:

prior_log_odds = ln(prior / (1 - prior))

For each applicable layer with confidence c ∈ [0, 1]:
  LR+ = sensitivity / (1 - specificity)   (how much a pass shifts toward "real")
  LR- = (1 - sensitivity) / specificity   (how much a fail shifts toward "fake")
  Δ   = c × ln(LR+) + (1-c) × ln(LR-)

posterior = sigmoid(prior_log_odds + Σ Δ)

Absent layers default to c = 0.5 (minimally informative). logOddsContributions exposes each layer's Δ for debugging and explainability.

Example:

const { posterior, verdict, logOddsContributions } = computeBayesianScore('NEWS', [
  { layerId: 'url', passed: false, confidence: 0 },  // 403 paywall
  { layerId: 'ai',  passed: true,  confidence: 0.85 },
]);
// posterior ≈ 0.81, verdict: 'VERIFIED'
// logOddsContributions: { url: -0.64, ai: +0.98 }

`DOMAIN_CONFIGS`

const DOMAIN_CONFIGS: Record<ContentDomain, DomainConfig>

Full domain configuration map. Each DomainConfig includes:

interface DomainConfig {
  domain: ContentDomain;
  label: string;                 // 'Academic' | 'News' | 'Government' | 'General'
  description: string;
  layers: BayesianLayerConfig[]; // applicable layers with weights + Bayesian params
  threshold: number;             // v1: minimum weighted score to VERIFY
  prior: number;                 // v2: P(reference is real given domain)
  bayesianThreshold: number;     // v2: minimum posterior probability to VERIFY
  aiInstruction: string;         // injected into AI evaluator prompt
  urlPatterns?: RegExp[];        // URL patterns for classification
  typePatterns?: string[];       // ReferenceType values for classification
}

interface BayesianLayerConfig extends LayerConfig {
  bayesian: {
    sensitivity: number; // P(pass given real), 0.0 to 1.0
    specificity: number; // P(fail given fake), 0.0 to 1.0
  };
}

Types

type ContentDomain = 'ACADEMIC' | 'NEWS' | 'GOVERNMENT' | 'EDUCATIONAL' | 'GENERAL';

type LayerId = 'doi' | 'title_search' | 'url' | 'ai';

interface LayerResult {
  layerId: LayerId;
  passed: boolean;
  confidence: number; // 0.0 to 1.0
}

interface LayerConfig {
  id: LayerId;
  weight: number;      // normalized weight, all layers in a domain sum to 1.0
  description: string;
}

Architecture

flowchart TD
    Input["Reference Input<br/>{ doi, url, type }"] --> Classify["classifyReference()"]
    Classify --> Domain["ContentDomain<br/>ACADEMIC / NEWS / GOVERNMENT / EDUCATIONAL / GENERAL"]
    Domain --> URL["Layer: URL<br/>(HEAD check)"]
    Domain --> AI["Layer: AI<br/>(LLM eval)"]
    Domain --> DOI["Layer: DOI / title_search"]
    URL --> Results["LayerResult[]<br/>{ layerId, passed, confidence }"]
    AI --> Results
    DOI --> Results
    Results --> V1["v1: weighted sum<br/>sum of weight × confidence, then score ≥ threshold"]
    Results --> V2["v2: Bayesian log-odds<br/>prior, per-layer update, then posterior ≥ threshold"]
    V1 --> Out1["Output<br/>{ score, verdict }"]
    V2 --> Out2["Output<br/>{ posterior, verdict, logOddsContributions }"]

src/
├── types.ts      ContentDomain, LayerId, LayerResult, DomainConfig, BayesianLayerConfig
├── domains.ts    DOMAIN_CONFIGS (the standard itself, including Bayesian params)
├── classify.ts   classifyReference()
├── score.ts      computeDomainAwareScore() [v1: weighted sum]
├── bayesian.ts   computeBayesianScore()    [v2: log-odds updating]
└── index.ts      public exports

The standard has zero runtime dependencies. Pure TypeScript that works in any JS environment.

Where this is used

This standard is application-agnostic. Any tool that cites web sources, including RAG pipelines, research assistants, search and answer engines, and content generators, can use it to verify references and attach a domain-aware trust badge (Academic, News, Government, Educational, or General) to every citation.

It is maintained as a standalone, dependency-free package by Andres Romero. Sotto is one consumer, vendoring it as a submodule so every reference it surfaces is scored by the logic here. When the standard improves via community PRs, any consumer benefits by updating its dependency.

Political Spectrum & Source Bias

The Problem

AI-generated content can inadvertently reflect a single political perspective when the source material fed into generation is ideologically one-sided. Output built entirely from sources rated "Left" by media-bias researchers will skew its framing, word choice, and which facts it emphasises, even if every cited reference passes verification.

Concrete example: A generated explainer on immigration policy sourced exclusively from outlets rated Left-Center produces accurate but one-sided content. Every URL resolves (✅ VERIFIED), yet a reader expecting balanced treatment is misled. Reference verification alone cannot catch this; it is orthogonal to the question of ideological balance.

Approach

Static media-bias lookup at content-extraction time, not at verification time.

The lookup runs once per source URL when content is first extracted, before generation runs. It annotates the extraction context with bias metadata. The generation prompt then receives conditional guidance, only when the topic is political, to seek balance or flag one-sidedness to the user.

This keeps bias detection cleanly separated from reference verification: the verification standard scores whether a reference is real; bias metadata informs whether the generation prompt should seek additional perspective.

How It Works

flowchart TD
    URLs["Source URLs<br/>(from content extraction)"] --> Extract["Domain extraction<br/>(strip protocol, path, query)"]
    Extract --> Lookup["MBFC dataset lookup<br/>{ bias, credibility, country }"]
    Lookup --> Detect{"Political topic?"}
    Detect -->|yes| Inject["Inject bias guidance<br/>into the generation prompt"]
    Detect -->|no| Skip["No bias guidance injected"]

Bias categories surfaced per source:

Value	Meaning
`left`	Far-left leaning
`left-center`	Center-left leaning
`center`	Least-biased / centrist
`right-center`	Center-right leaning
`right`	Far-right leaning
`conspiracy-pseudoscience`	Promotes conspiracy theories or pseudoscience
`satire`	Satire, content should not be treated as factual
`fake-news`	Known misinformation outlet

When all detected sources share the same non-center rating and the topic is political, the generation prompt is augmented with guidance to note the ideological lean to the reader and, where possible, incorporate contrasting framing.

Data Source

Dataset: drmikecrowe/mbfcext, a community-maintained mirror of Media Bias / Fact Check (MBFC) ratings, licensed MIT.

Size: 9,773 sources (as of dataset release)
Update cadence: Auto-updated daily from MBFC ratings via the upstream repository's CI
Fields used: domain, bias, credibility, country

No network call is made at generation time. The dataset is bundled as a static JSON lookup.

What This Does NOT Do

Does not reject sources. A source rated right or left is not excluded from the output. The verification standard continues to assess whether the reference is real.
Does not editorialize. The system does not label content "biased" to the end user unprompted. Guidance is injected into the generation prompt, not the generated output.
Does not apply to non-political topics. Technology tutorials, science explainers, and cooking guides all suppress bias guidance entirely when political topic detection returns negative.

Limitations & Transparency

Limitation	Detail
One framework among several	MBFC is widely cited but not the only media bias rating system. AllSides and Ad Fontes Media use different methodologies and sometimes reach different conclusions for the same outlet.
US-centric dataset	MBFC coverage is strongest for US English-language media. Non-US sources are rated but coverage is uneven; many regional outlets are absent from the dataset entirely.
Source-level ≠ article-level	A center-rated outlet can publish a one-sided op-ed. A left-rated outlet can publish a balanced investigative piece. The lookup reflects outlet-level ratings, not individual article analysis.
Static snapshot	The bundled dataset reflects ratings at a point in time. Outlets change ownership and editorial stance; the dataset may lag real-world shifts by weeks or months.
No confidence score	MBFC ratings are categorical, not probabilistic. The dataset does not expose reviewer agreement or confidence, so every rating is treated with equal weight regardless of how contested it may be.

Community contributions that extend coverage to non-US sources, integrate a second bias framework, or add article-level analysis are welcome. See CONTRIBUTING.md.

Prior Art & Theoretical Basis

The Bayesian log-odds algorithm is a well-established pattern in statistics and evidence-based medicine. Each domain layer's sensitivity/specificity pair functions as its likelihood ratio, and evidence accumulates additively in log space, a formulation due to Good (1950). The domain priors reflect empirical base rates of hallucinated references by content type, motivated by factuality benchmarks showing that hallucination rates differ significantly across content domains.

Methodology

Citation	Relevance
Good, I.J. (1950). Probability and the Weighing of Evidence. Charles Griffin.	Formalizes `initial_log_odds + weight_of_evidence = final_log_odds` where weight of evidence = log(likelihood ratio), the core formula used here
Wald, A. (1947). Sequential Analysis. Wiley.	Foundation of sequential likelihood-ratio testing; the theoretical precursor to log-odds evidence accumulation
Fagan, T.J. (1975). "Nomogram for Bayes's theorem." New England Journal of Medicine, 293, 257.	Graphical tool for applying Bayes' theorem via likelihood ratios to convert pre-test to post-test probability
Jaeschke, R., Guyatt, G.H., & Sackett, D.L. (1994). "Users' Guides to the Medical Literature III-B: How to Use an Article About a Diagnostic Test." JAMA, 271(9), 703 to 707.	Practical guide to interpreting diagnostic tests via sensitivity, specificity, and likelihood ratios
Good, I.J. (1985). "Weight of Evidence: A Brief Survey." In Bayesian Statistics 2, pp. 249 to 270.	Accessible summary of the weight-of-evidence framework by Good himself

Application domain

Citation	Relevance
Manakul, P., Liusie, A., & Gales, M.J.F. (2023). "SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models." EMNLP 2023. arXiv:2303.08896	Motivates hallucination detection in AI-generated content; demonstrates that consistency varies by content type
Min, S., et al. (2023). "FActScore: Fine-Grained Atomic Factuality Evaluation in Long-Form Text Generation." EMNLP 2023. arXiv:2305.14251	Fine-grained factuality evaluation showing hallucination rates differ significantly by domain, which motivates domain-specific priors

Calibration note: The sensitivity, specificity, and prior values in src/domains.ts are expert-set heuristics, not empirically calibrated against a labeled dataset. They reflect informed judgment about relative layer reliability. Empirically calibrating these against real vs. hallucinated reference data would be a high-value community contribution. See CONTRIBUTING.md for details.

Contributing

This standard is intentionally public. Weights, thresholds, URL patterns, and AI instructions are all community-improvable. If a credible source is being rejected, or a junk source is passing, open an issue or PR.

→ Read CONTRIBUTING.md

Key things you can improve:

URL patterns: add a news outlet, government agency, or academic publisher that's being misclassified
Weights: propose evidence-backed changes to layer weights or thresholds
AI instructions: improve the prompt guidance for each domain's AI evaluator
New domains: make the case for a new domain (for example SOCIAL_MEDIA or PREPRINT)

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.github		.github
src		src
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json
vitest.config.ts		vitest.config.ts

Folders and files

Latest commit

History

Repository files navigation

groundcheck

The Problem

The Fix

Quick Start

How Bayesian Scoring Works (Plain English)

The core idea

Starting confidence (the prior)

How each check updates your confidence

Evidence accumulates

Why not just use v1 (weighted sum)?

Domain Scoring

ACADEMIC

NEWS

GOVERNMENT

GENERAL

Classification Logic

API Reference

classifyReference(ref)

computeDomainAwareScore(domain, layerResults) v1

computeBayesianScore(domain, layerResults) v2

DOMAIN_CONFIGS

Types

Architecture

Where this is used

Political Spectrum & Source Bias

The Problem

Approach

How It Works

Data Source

What This Does NOT Do

Limitations & Transparency

Prior Art & Theoretical Basis

Contributing

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`classifyReference(ref)`

`computeDomainAwareScore(domain, layerResults)` v1

`computeBayesianScore(domain, layerResults)` v2

`DOMAIN_CONFIGS`

Packages