Skip to content

affromero/groundcheck

Repository files navigation

groundcheck

The open, domain-aware reference verification standard.

npm version npm downloads minzipped size zero dependencies CI TypeScript License: MIT PRs Welcome Ko-fi

Because a Reuters article and a Nature paper need different verification criteria.

Why this exists · Quick start · How it works · Domain scoring · API · Prior art · Contributing


The Problem

Most citation verification systems apply a single fixed formula to every source:

score = doi × 0.40 + title_search × 0.30 + url × 0.10 + ai × 0.20

This is broken for anything that isn't an academic paper. DOI and academic title search are irrelevant for a New York Times article, which means a live, credible Reuters story scores at most 0.23 against a 0.65 threshold and is always marked as removed.

News sources silently end up with zero references.

The Fix

Domain-aware scoring. Each source is classified into one of five domains first, then scored by the layers and weights appropriate for that domain.

ACADEMIC   →  doi(0.45) + title_search(0.30) + url(0.10) + ai(0.15)  ≥ 0.70
NEWS       →  url(0.35) + ai(0.65)                                    ≥ 0.50
GOVERNMENT →  url(0.40) + ai(0.60)                                    ≥ 0.55
EDUCATIONAL→  url(0.30) + title_search(0.10) + ai(0.60)              ≥ 0.50
GENERAL    →  url(0.30) + title_search(0.10) + ai(0.60)              ≥ 0.55

Concrete result: A live NYT article:

  • Old (fixed weights): 0.10×0.6 + 0.20×0.85 = 0.23REMOVED
  • New (domain-aware NEWS): 0.35×0.6 + 0.65×0.85 = 0.76VERIFIED

Paywalled article (403 response): AI alone scores 0.65 × 0.85 = 0.5525 > 0.50VERIFIED


Quick Start

npm install groundcheck
import {
  classifyReference,
  computeDomainAwareScore,
  DOMAIN_CONFIGS,
} from 'groundcheck';

// Step 1: classify the reference
const domain = classifyReference({
  doi: null,
  url: 'https://www.nytimes.com/2024/01/climate.html',
  type: 'ARTICLE',
});
// → 'NEWS'

// Step 2: run your verification layers (URL check, AI eval, etc.)
const layerResults = [
  { layerId: 'url', passed: true, confidence: 0.6 },
  { layerId: 'ai',  passed: true, confidence: 0.85 },
];

// Step 3: compute domain-aware score
const { score, verdict } = computeDomainAwareScore(domain, layerResults);
// → { score: 0.7625, verdict: 'VERIFIED' }

// Optional: access domain config (AI instructions, URL patterns, etc.)
const config = DOMAIN_CONFIGS[domain];
console.log(config.aiInstruction);
// → "Verify this is from a credible news outlet..."

v2 (Bayesian): Use computeBayesianScore for a probabilistic posterior with per-layer explainability. See the API reference for computeBayesianScore.


How Bayesian Scoring Works (Plain English)

Not a stats person? Here is the intuition behind computeBayesianScore.

The core idea

Start with a gut feeling, a starting probability, then update it with evidence. Each verification check nudges your confidence up or down. The result is a single probability (for example "81% chance this reference is real"), not a weighted percentage.

Starting confidence (the prior)

Each domain starts with a different base probability before any checks run. These reflect how often AI-generated content hallucinates references in that domain:

Domain Prior Why
GOVERNMENT 82% Official government sources are rarely fabricated
NEWS 75% Established outlets are usually real; moderate hallucination risk
ACADEMIC 72% Papers are generally genuine; fabrication exists but is less common
GENERAL 45% Anonymous web content has high hallucination risk, so it gets a lower starting confidence

How each check updates your confidence

Every verification layer has two diagnostic properties:

Property Plain English What it means
Sensitivity Hit rate How often does this check pass for a real reference? High means it rarely misses real refs
Specificity Fake-catcher rate How often does this check fail for a fake reference? High means it rarely lets fakes through

A layer with high sensitivity AND high specificity is highly informative. For NEWS, the AI layer (sensitivity 0.82, specificity 0.80) carries far more signal than the URL check (sensitivity 0.55, specificity 0.85), because news articles are commonly paywalled, so a failed URL is weak evidence of fakeness.

Evidence accumulates

The algorithm keeps a running tally in log-odds, a representation where you can simply add and subtract evidence instead of multiplying probabilities. At the end, it converts back to a normal probability from 0% to 100%.

Example, a paywalled NYT article (NEWS domain):

Step Evidence Running probability
Prior NEWS domain, moderate hallucination risk 75%
URL 403 (confidence = 0) Paywalled; credible outlets often return 403 ~61%
AI confirms credible outlet (confidence = 0.85) Strong positive signal ~81%
Verdict 81% ≥ 65% Bayesian threshold VERIFIED

A broken URL from a known outlet barely disqualifies the reference. Strong AI confirmation brings the probability to 81%, which clears the 65% threshold for NEWS.

Why not just use v1 (weighted sum)?

v1 is simpler and faster. v2 adds three things:

  1. A domain-calibrated starting estimate. The prior accounts for base rates of hallucination by content type.
  2. Principled evidence combination. Bayes' theorem handles asymmetric layers gracefully (a weak layer barely moves the posterior; a strong layer moves it a lot).
  3. Per-layer explainability. logOddsContributions shows exactly which check helped and which hurt, making failures debuggable.

For most references, v1 and v2 agree. The difference shows up in edge cases: a paywalled article from a credible outlet, or a reference with strong AI support but a broken URL.

See the full API docs for computeBayesianScore


Domain Scoring

Column guide: LR+ = sensitivity / (1 - specificity); LR- = (1 - sensitivity) / specificity. Higher LR+ means a confident pass is stronger evidence of a real reference; lower LR- means a confident fail is stronger evidence of a fake.

ACADEMIC

Peer-reviewed papers, preprints, books, technical reports

v1 threshold: ≥ 0.70 | v2 prior: 0.72 | v2 bayesianThreshold: ≥ 0.82

Layer v1 Weight Sensitivity Specificity LR+ LR-
doi 0.45 0.92 0.97 30.67 0.08
title_search 0.30 0.80 0.88 6.67 0.23
url 0.10 0.70 0.72 2.50 0.42
ai 0.15 0.78 0.82 4.33 0.27

Classified by: DOI present, arXiv/PubMed/Nature/IEEE URL, PAPER/BOOK type


NEWS

Established news outlets (NYT, Reuters, BBC, AP, Guardian, Bloomberg…)

v1 threshold: ≥ 0.50 | v2 prior: 0.75 | v2 bayesianThreshold: ≥ 0.65

Layer v1 Weight Sensitivity Specificity LR+ LR-
url 0.35 0.55 0.85 3.67 0.53
ai 0.65 0.82 0.80 4.10 0.23

Classified by: Reuters/NYT/BBC/AP/Guardian/Bloomberg/FT URL pattern, ARTICLE type · Lower v1 threshold because credible outlets often return 403/paywall

Paywall math: 0.65 × 0.85 = 0.5525 > 0.50. A credible outlet passes via AI even with a dead URL.


GOVERNMENT

Official government reports, legislation, statistics

v1 threshold: ≥ 0.55 | v2 prior: 0.82 | v2 bayesianThreshold: ≥ 0.72

Layer v1 Weight Sensitivity Specificity LR+ LR-
url 0.40 0.85 0.93 12.14 0.16
ai 0.60 0.80 0.84 5.00 0.24

Classified by: .gov, who.int, un.org, worldbank.org, oecd.org URL patterns


GENERAL

Wikipedia, blogs, videos, podcasts, and other web content

v1 threshold: ≥ 0.55 | v2 prior: 0.45 | v2 bayesianThreshold: ≥ 0.68

Layer v1 Weight Sensitivity Specificity LR+ LR-
url 0.30 0.65 0.70 2.17 0.50
title_search 0.10 0.30 0.75 1.20 0.93
ai 0.60 0.72 0.78 3.27 0.36

Classified by: Catch-all for anything not classified above


Classification Logic

classifyReference follows a strict priority order:

flowchart TD
    Start["Reference { doi, url, type }"] --> Q1{"DOI present?"}
    Q1 -->|yes| ACAD["ACADEMIC"]
    Q1 -->|no| Q2{"URL matches ACADEMIC patterns?"}
    Q2 -->|yes| ACAD
    Q2 -->|no| Q3{"URL matches NEWS patterns?"}
    Q3 -->|yes| NEWS["NEWS"]
    Q3 -->|no| Q4{"URL matches GOVERNMENT patterns?"}
    Q4 -->|yes| GOV["GOVERNMENT"]
    Q4 -->|no| Q5{"Type matches ACADEMIC types?"}
    Q5 -->|yes| ACAD
    Q5 -->|no| Q6{"ARTICLE type with matching URL?"}
    Q6 -->|yes| NEWS
    Q6 -->|no| GEN["GENERAL (fallback)"]
Loading

API Reference

classifyReference(ref)

Classify a reference into a content domain.

function classifyReference(ref: {
  doi?: string | null;
  url?: string | null;
  type?: string | null;
}): ContentDomain

computeDomainAwareScore(domain, layerResults) v1

Compute a weighted-sum score for a given domain.

function computeDomainAwareScore(
  domain: ContentDomain,
  layerResults: LayerResult[]
): { score: number; verdict: 'VERIFIED' | 'FAILED' }

score is between 0 and 1. verdict is 'VERIFIED' if score >= domain.threshold, 'FAILED' otherwise.

Layer results for layers not applicable to the domain are ignored.


computeBayesianScore(domain, layerResults) v2

Compute a Bayesian posterior probability using log-odds updating.

function computeBayesianScore(
  domain: ContentDomain,
  layerResults: LayerResult[]
): {
  posterior: number;              // P(reference is real given evidence), 0.0 to 1.0
  verdict: 'VERIFIED' | 'FAILED'; // posterior >= domain.bayesianThreshold
  logOddsContributions: Record<string, number>; // per-layer Δ log-odds (for transparency)
}

Algorithm:

prior_log_odds = ln(prior / (1 - prior))

For each applicable layer with confidence c ∈ [0, 1]:
  LR+ = sensitivity / (1 - specificity)   (how much a pass shifts toward "real")
  LR- = (1 - sensitivity) / specificity   (how much a fail shifts toward "fake")
  Δ   = c × ln(LR+) + (1-c) × ln(LR-)

posterior = sigmoid(prior_log_odds + Σ Δ)

Absent layers default to c = 0.5 (minimally informative). logOddsContributions exposes each layer's Δ for debugging and explainability.

Example:

const { posterior, verdict, logOddsContributions } = computeBayesianScore('NEWS', [
  { layerId: 'url', passed: false, confidence: 0 },  // 403 paywall
  { layerId: 'ai',  passed: true,  confidence: 0.85 },
]);
// posterior ≈ 0.81, verdict: 'VERIFIED'
// logOddsContributions: { url: -0.64, ai: +0.98 }

DOMAIN_CONFIGS

const DOMAIN_CONFIGS: Record<ContentDomain, DomainConfig>

Full domain configuration map. Each DomainConfig includes:

interface DomainConfig {
  domain: ContentDomain;
  label: string;                 // 'Academic' | 'News' | 'Government' | 'General'
  description: string;
  layers: BayesianLayerConfig[]; // applicable layers with weights + Bayesian params
  threshold: number;             // v1: minimum weighted score to VERIFY
  prior: number;                 // v2: P(reference is real given domain)
  bayesianThreshold: number;     // v2: minimum posterior probability to VERIFY
  aiInstruction: string;         // injected into AI evaluator prompt
  urlPatterns?: RegExp[];        // URL patterns for classification
  typePatterns?: string[];       // ReferenceType values for classification
}

interface BayesianLayerConfig extends LayerConfig {
  bayesian: {
    sensitivity: number; // P(pass given real), 0.0 to 1.0
    specificity: number; // P(fail given fake), 0.0 to 1.0
  };
}

Types

type ContentDomain = 'ACADEMIC' | 'NEWS' | 'GOVERNMENT' | 'EDUCATIONAL' | 'GENERAL';

type LayerId = 'doi' | 'title_search' | 'url' | 'ai';

interface LayerResult {
  layerId: LayerId;
  passed: boolean;
  confidence: number; // 0.0 to 1.0
}

interface LayerConfig {
  id: LayerId;
  weight: number;      // normalized weight, all layers in a domain sum to 1.0
  description: string;
}

Architecture

flowchart TD
    Input["Reference Input<br/>{ doi, url, type }"] --> Classify["classifyReference()"]
    Classify --> Domain["ContentDomain<br/>ACADEMIC / NEWS / GOVERNMENT / EDUCATIONAL / GENERAL"]
    Domain --> URL["Layer: URL<br/>(HEAD check)"]
    Domain --> AI["Layer: AI<br/>(LLM eval)"]
    Domain --> DOI["Layer: DOI / title_search"]
    URL --> Results["LayerResult[]<br/>{ layerId, passed, confidence }"]
    AI --> Results
    DOI --> Results
    Results --> V1["v1: weighted sum<br/>sum of weight × confidence, then score ≥ threshold"]
    Results --> V2["v2: Bayesian log-odds<br/>prior, per-layer update, then posterior ≥ threshold"]
    V1 --> Out1["Output<br/>{ score, verdict }"]
    V2 --> Out2["Output<br/>{ posterior, verdict, logOddsContributions }"]
Loading
src/
├── types.ts      ContentDomain, LayerId, LayerResult, DomainConfig, BayesianLayerConfig
├── domains.ts    DOMAIN_CONFIGS (the standard itself, including Bayesian params)
├── classify.ts   classifyReference()
├── score.ts      computeDomainAwareScore() [v1: weighted sum]
├── bayesian.ts   computeBayesianScore()    [v2: log-odds updating]
└── index.ts      public exports

The standard has zero runtime dependencies. Pure TypeScript that works in any JS environment.


Where this is used

This standard is application-agnostic. Any tool that cites web sources, including RAG pipelines, research assistants, search and answer engines, and content generators, can use it to verify references and attach a domain-aware trust badge (Academic, News, Government, Educational, or General) to every citation.

It is maintained as a standalone, dependency-free package by Andres Romero. Sotto is one consumer, vendoring it as a submodule so every reference it surfaces is scored by the logic here. When the standard improves via community PRs, any consumer benefits by updating its dependency.


Political Spectrum & Source Bias

The Problem

AI-generated content can inadvertently reflect a single political perspective when the source material fed into generation is ideologically one-sided. Output built entirely from sources rated "Left" by media-bias researchers will skew its framing, word choice, and which facts it emphasises, even if every cited reference passes verification.

Concrete example: A generated explainer on immigration policy sourced exclusively from outlets rated Left-Center produces accurate but one-sided content. Every URL resolves (✅ VERIFIED), yet a reader expecting balanced treatment is misled. Reference verification alone cannot catch this; it is orthogonal to the question of ideological balance.

Approach

Static media-bias lookup at content-extraction time, not at verification time.

The lookup runs once per source URL when content is first extracted, before generation runs. It annotates the extraction context with bias metadata. The generation prompt then receives conditional guidance, only when the topic is political, to seek balance or flag one-sidedness to the user.

This keeps bias detection cleanly separated from reference verification: the verification standard scores whether a reference is real; bias metadata informs whether the generation prompt should seek additional perspective.

How It Works

flowchart TD
    URLs["Source URLs<br/>(from content extraction)"] --> Extract["Domain extraction<br/>(strip protocol, path, query)"]
    Extract --> Lookup["MBFC dataset lookup<br/>{ bias, credibility, country }"]
    Lookup --> Detect{"Political topic?"}
    Detect -->|yes| Inject["Inject bias guidance<br/>into the generation prompt"]
    Detect -->|no| Skip["No bias guidance injected"]
Loading

Bias categories surfaced per source:

Value Meaning
left Far-left leaning
left-center Center-left leaning
center Least-biased / centrist
right-center Center-right leaning
right Far-right leaning
conspiracy-pseudoscience Promotes conspiracy theories or pseudoscience
satire Satire, content should not be treated as factual
fake-news Known misinformation outlet

When all detected sources share the same non-center rating and the topic is political, the generation prompt is augmented with guidance to note the ideological lean to the reader and, where possible, incorporate contrasting framing.

Data Source

Dataset: drmikecrowe/mbfcext, a community-maintained mirror of Media Bias / Fact Check (MBFC) ratings, licensed MIT.

  • Size: 9,773 sources (as of dataset release)
  • Update cadence: Auto-updated daily from MBFC ratings via the upstream repository's CI
  • Fields used: domain, bias, credibility, country

No network call is made at generation time. The dataset is bundled as a static JSON lookup.

What This Does NOT Do

  • Does not reject sources. A source rated right or left is not excluded from the output. The verification standard continues to assess whether the reference is real.
  • Does not editorialize. The system does not label content "biased" to the end user unprompted. Guidance is injected into the generation prompt, not the generated output.
  • Does not apply to non-political topics. Technology tutorials, science explainers, and cooking guides all suppress bias guidance entirely when political topic detection returns negative.

Limitations & Transparency

Limitation Detail
One framework among several MBFC is widely cited but not the only media bias rating system. AllSides and Ad Fontes Media use different methodologies and sometimes reach different conclusions for the same outlet.
US-centric dataset MBFC coverage is strongest for US English-language media. Non-US sources are rated but coverage is uneven; many regional outlets are absent from the dataset entirely.
Source-level ≠ article-level A center-rated outlet can publish a one-sided op-ed. A left-rated outlet can publish a balanced investigative piece. The lookup reflects outlet-level ratings, not individual article analysis.
Static snapshot The bundled dataset reflects ratings at a point in time. Outlets change ownership and editorial stance; the dataset may lag real-world shifts by weeks or months.
No confidence score MBFC ratings are categorical, not probabilistic. The dataset does not expose reviewer agreement or confidence, so every rating is treated with equal weight regardless of how contested it may be.

Community contributions that extend coverage to non-US sources, integrate a second bias framework, or add article-level analysis are welcome. See CONTRIBUTING.md.


Prior Art & Theoretical Basis

The Bayesian log-odds algorithm is a well-established pattern in statistics and evidence-based medicine. Each domain layer's sensitivity/specificity pair functions as its likelihood ratio, and evidence accumulates additively in log space, a formulation due to Good (1950). The domain priors reflect empirical base rates of hallucinated references by content type, motivated by factuality benchmarks showing that hallucination rates differ significantly across content domains.

Methodology

Citation Relevance
Good, I.J. (1950). Probability and the Weighing of Evidence. Charles Griffin. Formalizes initial_log_odds + weight_of_evidence = final_log_odds where weight of evidence = log(likelihood ratio), the core formula used here
Wald, A. (1947). Sequential Analysis. Wiley. Foundation of sequential likelihood-ratio testing; the theoretical precursor to log-odds evidence accumulation
Fagan, T.J. (1975). "Nomogram for Bayes's theorem." New England Journal of Medicine, 293, 257. Graphical tool for applying Bayes' theorem via likelihood ratios to convert pre-test to post-test probability
Jaeschke, R., Guyatt, G.H., & Sackett, D.L. (1994). "Users' Guides to the Medical Literature III-B: How to Use an Article About a Diagnostic Test." JAMA, 271(9), 703 to 707. Practical guide to interpreting diagnostic tests via sensitivity, specificity, and likelihood ratios
Good, I.J. (1985). "Weight of Evidence: A Brief Survey." In Bayesian Statistics 2, pp. 249 to 270. Accessible summary of the weight-of-evidence framework by Good himself

Application domain

Citation Relevance
Manakul, P., Liusie, A., & Gales, M.J.F. (2023). "SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models." EMNLP 2023. arXiv:2303.08896 Motivates hallucination detection in AI-generated content; demonstrates that consistency varies by content type
Min, S., et al. (2023). "FActScore: Fine-Grained Atomic Factuality Evaluation in Long-Form Text Generation." EMNLP 2023. arXiv:2305.14251 Fine-grained factuality evaluation showing hallucination rates differ significantly by domain, which motivates domain-specific priors

Calibration note: The sensitivity, specificity, and prior values in src/domains.ts are expert-set heuristics, not empirically calibrated against a labeled dataset. They reflect informed judgment about relative layer reliability. Empirically calibrating these against real vs. hallucinated reference data would be a high-value community contribution. See CONTRIBUTING.md for details.


Contributing

This standard is intentionally public. Weights, thresholds, URL patterns, and AI instructions are all community-improvable. If a credible source is being rejected, or a junk source is passing, open an issue or PR.

Read CONTRIBUTING.md

Key things you can improve:

  • URL patterns: add a news outlet, government agency, or academic publisher that's being misclassified
  • Weights: propose evidence-backed changes to layer weights or thresholds
  • AI instructions: improve the prompt guidance for each domain's AI evaluator
  • New domains: make the case for a new domain (for example SOCIAL_MEDIA or PREPRINT)

License

MIT. Copyright © 2024 Andres Romero

About

Domain-aware citation and grounding verification for AI agents. Scores whether cited sources (academic, news, government, educational, general) are real and credible. Zero-dependency TypeScript.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors