The open, domain-aware reference verification standard.
Because a Reuters article and a Nature paper need different verification criteria.
Why this exists · Quick start · How it works · Domain scoring · API · Prior art · Contributing
Most citation verification systems apply a single fixed formula to every source:
score = doi × 0.40 + title_search × 0.30 + url × 0.10 + ai × 0.20
This is broken for anything that isn't an academic paper. DOI and academic title search are irrelevant for a New York Times article, which means a live, credible Reuters story scores at most 0.23 against a 0.65 threshold and is always marked as removed.
News sources silently end up with zero references.
Domain-aware scoring. Each source is classified into one of five domains first, then scored by the layers and weights appropriate for that domain.
ACADEMIC → doi(0.45) + title_search(0.30) + url(0.10) + ai(0.15) ≥ 0.70
NEWS → url(0.35) + ai(0.65) ≥ 0.50
GOVERNMENT → url(0.40) + ai(0.60) ≥ 0.55
EDUCATIONAL→ url(0.30) + title_search(0.10) + ai(0.60) ≥ 0.50
GENERAL → url(0.30) + title_search(0.10) + ai(0.60) ≥ 0.55
Concrete result: A live NYT article:
- Old (fixed weights):
0.10×0.6 + 0.20×0.85 = 0.23→ REMOVED ❌ - New (domain-aware NEWS):
0.35×0.6 + 0.65×0.85 = 0.76→ VERIFIED ✅
Paywalled article (403 response): AI alone scores 0.65 × 0.85 = 0.5525 > 0.50 → VERIFIED ✅
npm install groundcheckimport {
classifyReference,
computeDomainAwareScore,
DOMAIN_CONFIGS,
} from 'groundcheck';
// Step 1: classify the reference
const domain = classifyReference({
doi: null,
url: 'https://www.nytimes.com/2024/01/climate.html',
type: 'ARTICLE',
});
// → 'NEWS'
// Step 2: run your verification layers (URL check, AI eval, etc.)
const layerResults = [
{ layerId: 'url', passed: true, confidence: 0.6 },
{ layerId: 'ai', passed: true, confidence: 0.85 },
];
// Step 3: compute domain-aware score
const { score, verdict } = computeDomainAwareScore(domain, layerResults);
// → { score: 0.7625, verdict: 'VERIFIED' }
// Optional: access domain config (AI instructions, URL patterns, etc.)
const config = DOMAIN_CONFIGS[domain];
console.log(config.aiInstruction);
// → "Verify this is from a credible news outlet..."v2 (Bayesian): Use
computeBayesianScorefor a probabilistic posterior with per-layer explainability. See the API reference for computeBayesianScore.
Not a stats person? Here is the intuition behind computeBayesianScore.
Start with a gut feeling, a starting probability, then update it with evidence. Each verification check nudges your confidence up or down. The result is a single probability (for example "81% chance this reference is real"), not a weighted percentage.
Each domain starts with a different base probability before any checks run. These reflect how often AI-generated content hallucinates references in that domain:
| Domain | Prior | Why |
|---|---|---|
| GOVERNMENT | 82% | Official government sources are rarely fabricated |
| NEWS | 75% | Established outlets are usually real; moderate hallucination risk |
| ACADEMIC | 72% | Papers are generally genuine; fabrication exists but is less common |
| GENERAL | 45% | Anonymous web content has high hallucination risk, so it gets a lower starting confidence |
Every verification layer has two diagnostic properties:
| Property | Plain English | What it means |
|---|---|---|
| Sensitivity | Hit rate | How often does this check pass for a real reference? High means it rarely misses real refs |
| Specificity | Fake-catcher rate | How often does this check fail for a fake reference? High means it rarely lets fakes through |
A layer with high sensitivity AND high specificity is highly informative. For NEWS, the AI layer (sensitivity 0.82, specificity 0.80) carries far more signal than the URL check (sensitivity 0.55, specificity 0.85), because news articles are commonly paywalled, so a failed URL is weak evidence of fakeness.
The algorithm keeps a running tally in log-odds, a representation where you can simply add and subtract evidence instead of multiplying probabilities. At the end, it converts back to a normal probability from 0% to 100%.
Example, a paywalled NYT article (NEWS domain):
| Step | Evidence | Running probability |
|---|---|---|
| Prior | NEWS domain, moderate hallucination risk | 75% |
| URL 403 (confidence = 0) | Paywalled; credible outlets often return 403 | ~61% |
| AI confirms credible outlet (confidence = 0.85) | Strong positive signal | ~81% |
| Verdict | 81% ≥ 65% Bayesian threshold | ✅ VERIFIED |
A broken URL from a known outlet barely disqualifies the reference. Strong AI confirmation brings the probability to 81%, which clears the 65% threshold for NEWS.
v1 is simpler and faster. v2 adds three things:
- A domain-calibrated starting estimate. The prior accounts for base rates of hallucination by content type.
- Principled evidence combination. Bayes' theorem handles asymmetric layers gracefully (a weak layer barely moves the posterior; a strong layer moves it a lot).
- Per-layer explainability.
logOddsContributionsshows exactly which check helped and which hurt, making failures debuggable.
For most references, v1 and v2 agree. The difference shows up in edge cases: a paywalled article from a credible outlet, or a reference with strong AI support but a broken URL.
→ See the full API docs for computeBayesianScore
Column guide: LR+ = sensitivity / (1 - specificity); LR- = (1 - sensitivity) / specificity. Higher LR+ means a confident pass is stronger evidence of a real reference; lower LR- means a confident fail is stronger evidence of a fake.
Peer-reviewed papers, preprints, books, technical reports
v1 threshold: ≥ 0.70 | v2 prior: 0.72 | v2 bayesianThreshold: ≥ 0.82
| Layer | v1 Weight | Sensitivity | Specificity | LR+ | LR- |
|---|---|---|---|---|---|
doi |
0.45 | 0.92 | 0.97 | 30.67 | 0.08 |
title_search |
0.30 | 0.80 | 0.88 | 6.67 | 0.23 |
url |
0.10 | 0.70 | 0.72 | 2.50 | 0.42 |
ai |
0.15 | 0.78 | 0.82 | 4.33 | 0.27 |
Classified by: DOI present, arXiv/PubMed/Nature/IEEE URL, PAPER/BOOK type
Established news outlets (NYT, Reuters, BBC, AP, Guardian, Bloomberg…)
v1 threshold: ≥ 0.50 | v2 prior: 0.75 | v2 bayesianThreshold: ≥ 0.65
| Layer | v1 Weight | Sensitivity | Specificity | LR+ | LR- |
|---|---|---|---|---|---|
url |
0.35 | 0.55 | 0.85 | 3.67 | 0.53 |
ai |
0.65 | 0.82 | 0.80 | 4.10 | 0.23 |
Classified by: Reuters/NYT/BBC/AP/Guardian/Bloomberg/FT URL pattern, ARTICLE type · Lower v1 threshold because credible outlets often return 403/paywall
Paywall math:
0.65 × 0.85 = 0.5525 > 0.50. A credible outlet passes via AI even with a dead URL.
Official government reports, legislation, statistics
v1 threshold: ≥ 0.55 | v2 prior: 0.82 | v2 bayesianThreshold: ≥ 0.72
| Layer | v1 Weight | Sensitivity | Specificity | LR+ | LR- |
|---|---|---|---|---|---|
url |
0.40 | 0.85 | 0.93 | 12.14 | 0.16 |
ai |
0.60 | 0.80 | 0.84 | 5.00 | 0.24 |
Classified by: .gov, who.int, un.org, worldbank.org, oecd.org URL patterns
Wikipedia, blogs, videos, podcasts, and other web content
v1 threshold: ≥ 0.55 | v2 prior: 0.45 | v2 bayesianThreshold: ≥ 0.68
| Layer | v1 Weight | Sensitivity | Specificity | LR+ | LR- |
|---|---|---|---|---|---|
url |
0.30 | 0.65 | 0.70 | 2.17 | 0.50 |
title_search |
0.10 | 0.30 | 0.75 | 1.20 | 0.93 |
ai |
0.60 | 0.72 | 0.78 | 3.27 | 0.36 |
Classified by: Catch-all for anything not classified above
classifyReference follows a strict priority order:
flowchart TD
Start["Reference { doi, url, type }"] --> Q1{"DOI present?"}
Q1 -->|yes| ACAD["ACADEMIC"]
Q1 -->|no| Q2{"URL matches ACADEMIC patterns?"}
Q2 -->|yes| ACAD
Q2 -->|no| Q3{"URL matches NEWS patterns?"}
Q3 -->|yes| NEWS["NEWS"]
Q3 -->|no| Q4{"URL matches GOVERNMENT patterns?"}
Q4 -->|yes| GOV["GOVERNMENT"]
Q4 -->|no| Q5{"Type matches ACADEMIC types?"}
Q5 -->|yes| ACAD
Q5 -->|no| Q6{"ARTICLE type with matching URL?"}
Q6 -->|yes| NEWS
Q6 -->|no| GEN["GENERAL (fallback)"]
Classify a reference into a content domain.
function classifyReference(ref: {
doi?: string | null;
url?: string | null;
type?: string | null;
}): ContentDomainCompute a weighted-sum score for a given domain.
function computeDomainAwareScore(
domain: ContentDomain,
layerResults: LayerResult[]
): { score: number; verdict: 'VERIFIED' | 'FAILED' }score is between 0 and 1. verdict is 'VERIFIED' if score >= domain.threshold, 'FAILED' otherwise.
Layer results for layers not applicable to the domain are ignored.
Compute a Bayesian posterior probability using log-odds updating.
function computeBayesianScore(
domain: ContentDomain,
layerResults: LayerResult[]
): {
posterior: number; // P(reference is real given evidence), 0.0 to 1.0
verdict: 'VERIFIED' | 'FAILED'; // posterior >= domain.bayesianThreshold
logOddsContributions: Record<string, number>; // per-layer Δ log-odds (for transparency)
}Algorithm:
prior_log_odds = ln(prior / (1 - prior))
For each applicable layer with confidence c ∈ [0, 1]:
LR+ = sensitivity / (1 - specificity) (how much a pass shifts toward "real")
LR- = (1 - sensitivity) / specificity (how much a fail shifts toward "fake")
Δ = c × ln(LR+) + (1-c) × ln(LR-)
posterior = sigmoid(prior_log_odds + Σ Δ)
Absent layers default to c = 0.5 (minimally informative). logOddsContributions exposes each layer's Δ for debugging and explainability.
Example:
const { posterior, verdict, logOddsContributions } = computeBayesianScore('NEWS', [
{ layerId: 'url', passed: false, confidence: 0 }, // 403 paywall
{ layerId: 'ai', passed: true, confidence: 0.85 },
]);
// posterior ≈ 0.81, verdict: 'VERIFIED'
// logOddsContributions: { url: -0.64, ai: +0.98 }const DOMAIN_CONFIGS: Record<ContentDomain, DomainConfig>Full domain configuration map. Each DomainConfig includes:
interface DomainConfig {
domain: ContentDomain;
label: string; // 'Academic' | 'News' | 'Government' | 'General'
description: string;
layers: BayesianLayerConfig[]; // applicable layers with weights + Bayesian params
threshold: number; // v1: minimum weighted score to VERIFY
prior: number; // v2: P(reference is real given domain)
bayesianThreshold: number; // v2: minimum posterior probability to VERIFY
aiInstruction: string; // injected into AI evaluator prompt
urlPatterns?: RegExp[]; // URL patterns for classification
typePatterns?: string[]; // ReferenceType values for classification
}
interface BayesianLayerConfig extends LayerConfig {
bayesian: {
sensitivity: number; // P(pass given real), 0.0 to 1.0
specificity: number; // P(fail given fake), 0.0 to 1.0
};
}type ContentDomain = 'ACADEMIC' | 'NEWS' | 'GOVERNMENT' | 'EDUCATIONAL' | 'GENERAL';
type LayerId = 'doi' | 'title_search' | 'url' | 'ai';
interface LayerResult {
layerId: LayerId;
passed: boolean;
confidence: number; // 0.0 to 1.0
}
interface LayerConfig {
id: LayerId;
weight: number; // normalized weight, all layers in a domain sum to 1.0
description: string;
}flowchart TD
Input["Reference Input<br/>{ doi, url, type }"] --> Classify["classifyReference()"]
Classify --> Domain["ContentDomain<br/>ACADEMIC / NEWS / GOVERNMENT / EDUCATIONAL / GENERAL"]
Domain --> URL["Layer: URL<br/>(HEAD check)"]
Domain --> AI["Layer: AI<br/>(LLM eval)"]
Domain --> DOI["Layer: DOI / title_search"]
URL --> Results["LayerResult[]<br/>{ layerId, passed, confidence }"]
AI --> Results
DOI --> Results
Results --> V1["v1: weighted sum<br/>sum of weight × confidence, then score ≥ threshold"]
Results --> V2["v2: Bayesian log-odds<br/>prior, per-layer update, then posterior ≥ threshold"]
V1 --> Out1["Output<br/>{ score, verdict }"]
V2 --> Out2["Output<br/>{ posterior, verdict, logOddsContributions }"]
src/
├── types.ts ContentDomain, LayerId, LayerResult, DomainConfig, BayesianLayerConfig
├── domains.ts DOMAIN_CONFIGS (the standard itself, including Bayesian params)
├── classify.ts classifyReference()
├── score.ts computeDomainAwareScore() [v1: weighted sum]
├── bayesian.ts computeBayesianScore() [v2: log-odds updating]
└── index.ts public exports
The standard has zero runtime dependencies. Pure TypeScript that works in any JS environment.
This standard is application-agnostic. Any tool that cites web sources, including RAG pipelines, research assistants, search and answer engines, and content generators, can use it to verify references and attach a domain-aware trust badge (Academic, News, Government, Educational, or General) to every citation.
It is maintained as a standalone, dependency-free package by Andres Romero. Sotto is one consumer, vendoring it as a submodule so every reference it surfaces is scored by the logic here. When the standard improves via community PRs, any consumer benefits by updating its dependency.
AI-generated content can inadvertently reflect a single political perspective when the source material fed into generation is ideologically one-sided. Output built entirely from sources rated "Left" by media-bias researchers will skew its framing, word choice, and which facts it emphasises, even if every cited reference passes verification.
Concrete example: A generated explainer on immigration policy sourced exclusively from outlets rated Left-Center produces accurate but one-sided content. Every URL resolves (✅ VERIFIED), yet a reader expecting balanced treatment is misled. Reference verification alone cannot catch this; it is orthogonal to the question of ideological balance.
Static media-bias lookup at content-extraction time, not at verification time.
The lookup runs once per source URL when content is first extracted, before generation runs. It annotates the extraction context with bias metadata. The generation prompt then receives conditional guidance, only when the topic is political, to seek balance or flag one-sidedness to the user.
This keeps bias detection cleanly separated from reference verification: the verification standard scores whether a reference is real; bias metadata informs whether the generation prompt should seek additional perspective.
flowchart TD
URLs["Source URLs<br/>(from content extraction)"] --> Extract["Domain extraction<br/>(strip protocol, path, query)"]
Extract --> Lookup["MBFC dataset lookup<br/>{ bias, credibility, country }"]
Lookup --> Detect{"Political topic?"}
Detect -->|yes| Inject["Inject bias guidance<br/>into the generation prompt"]
Detect -->|no| Skip["No bias guidance injected"]
Bias categories surfaced per source:
| Value | Meaning |
|---|---|
left |
Far-left leaning |
left-center |
Center-left leaning |
center |
Least-biased / centrist |
right-center |
Center-right leaning |
right |
Far-right leaning |
conspiracy-pseudoscience |
Promotes conspiracy theories or pseudoscience |
satire |
Satire, content should not be treated as factual |
fake-news |
Known misinformation outlet |
When all detected sources share the same non-center rating and the topic is political, the generation prompt is augmented with guidance to note the ideological lean to the reader and, where possible, incorporate contrasting framing.
Dataset: drmikecrowe/mbfcext, a
community-maintained mirror of Media Bias / Fact Check (MBFC)
ratings, licensed MIT.
- Size: 9,773 sources (as of dataset release)
- Update cadence: Auto-updated daily from MBFC ratings via the upstream repository's CI
- Fields used:
domain,bias,credibility,country
No network call is made at generation time. The dataset is bundled as a static JSON lookup.
- Does not reject sources. A source rated
rightorleftis not excluded from the output. The verification standard continues to assess whether the reference is real. - Does not editorialize. The system does not label content "biased" to the end user unprompted. Guidance is injected into the generation prompt, not the generated output.
- Does not apply to non-political topics. Technology tutorials, science explainers, and cooking guides all suppress bias guidance entirely when political topic detection returns negative.
| Limitation | Detail |
|---|---|
| One framework among several | MBFC is widely cited but not the only media bias rating system. AllSides and Ad Fontes Media use different methodologies and sometimes reach different conclusions for the same outlet. |
| US-centric dataset | MBFC coverage is strongest for US English-language media. Non-US sources are rated but coverage is uneven; many regional outlets are absent from the dataset entirely. |
| Source-level ≠ article-level | A center-rated outlet can publish a one-sided op-ed. A left-rated outlet can publish a balanced investigative piece. The lookup reflects outlet-level ratings, not individual article analysis. |
| Static snapshot | The bundled dataset reflects ratings at a point in time. Outlets change ownership and editorial stance; the dataset may lag real-world shifts by weeks or months. |
| No confidence score | MBFC ratings are categorical, not probabilistic. The dataset does not expose reviewer agreement or confidence, so every rating is treated with equal weight regardless of how contested it may be. |
Community contributions that extend coverage to non-US sources, integrate a second bias framework, or add article-level analysis are welcome. See CONTRIBUTING.md.
The Bayesian log-odds algorithm is a well-established pattern in statistics and evidence-based medicine. Each domain layer's sensitivity/specificity pair functions as its likelihood ratio, and evidence accumulates additively in log space, a formulation due to Good (1950). The domain priors reflect empirical base rates of hallucinated references by content type, motivated by factuality benchmarks showing that hallucination rates differ significantly across content domains.
Methodology
| Citation | Relevance |
|---|---|
| Good, I.J. (1950). Probability and the Weighing of Evidence. Charles Griffin. | Formalizes initial_log_odds + weight_of_evidence = final_log_odds where weight of evidence = log(likelihood ratio), the core formula used here |
| Wald, A. (1947). Sequential Analysis. Wiley. | Foundation of sequential likelihood-ratio testing; the theoretical precursor to log-odds evidence accumulation |
| Fagan, T.J. (1975). "Nomogram for Bayes's theorem." New England Journal of Medicine, 293, 257. | Graphical tool for applying Bayes' theorem via likelihood ratios to convert pre-test to post-test probability |
| Jaeschke, R., Guyatt, G.H., & Sackett, D.L. (1994). "Users' Guides to the Medical Literature III-B: How to Use an Article About a Diagnostic Test." JAMA, 271(9), 703 to 707. | Practical guide to interpreting diagnostic tests via sensitivity, specificity, and likelihood ratios |
| Good, I.J. (1985). "Weight of Evidence: A Brief Survey." In Bayesian Statistics 2, pp. 249 to 270. | Accessible summary of the weight-of-evidence framework by Good himself |
Application domain
| Citation | Relevance |
|---|---|
| Manakul, P., Liusie, A., & Gales, M.J.F. (2023). "SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models." EMNLP 2023. arXiv:2303.08896 | Motivates hallucination detection in AI-generated content; demonstrates that consistency varies by content type |
| Min, S., et al. (2023). "FActScore: Fine-Grained Atomic Factuality Evaluation in Long-Form Text Generation." EMNLP 2023. arXiv:2305.14251 | Fine-grained factuality evaluation showing hallucination rates differ significantly by domain, which motivates domain-specific priors |
Calibration note: The
sensitivity,specificity, andpriorvalues insrc/domains.tsare expert-set heuristics, not empirically calibrated against a labeled dataset. They reflect informed judgment about relative layer reliability. Empirically calibrating these against real vs. hallucinated reference data would be a high-value community contribution. See CONTRIBUTING.md for details.
This standard is intentionally public. Weights, thresholds, URL patterns, and AI instructions are all community-improvable. If a credible source is being rejected, or a junk source is passing, open an issue or PR.
Key things you can improve:
- URL patterns: add a news outlet, government agency, or academic publisher that's being misclassified
- Weights: propose evidence-backed changes to layer weights or thresholds
- AI instructions: improve the prompt guidance for each domain's AI evaluator
- New domains: make the case for a new domain (for example
SOCIAL_MEDIAorPREPRINT)
MIT. Copyright © 2024 Andres Romero