Executable Claims

A writing-time assistant that extracts scientific claims from manuscripts, grounds them against cited evidence, and compiles executable test capsules for verification.

Paper: Executable Claims: A Writing-Time Framework for Automated Claim Verification via Evidence Capsules (AAAI 2026 Workshop on AI for Research)

Overview

Executable Claims operates as a real-time layer over the scientific writing process. As you write, it:

Mines claims from manuscript text using LLM-based extraction with regex-backed validation
Retrieves evidence from the cited literature via arXiv, Semantic Scholar, and embedded PDFs (hybrid BM25 + dense retrieval)
Checks entailment through joint textual and numeric verification (unit-aware via pint, default +/-5% tolerance)
Builds capsules -- lightweight, executable pytest scripts and Jupyter notebooks that encode each claim as a runnable test
Surfaces counterevidence to reduce confirmation bias by flagging contradictions in the literature

Manuscript Text --> Claim Miner (GPT-4o / Claude)
                       |
                       v
                Evidence Retriever (arXiv + S2 + BibTeX + PDF)
                       |
                       v
                Entailment Checker (textual + numeric w/ unit normalization)
                       |
                       v
                Capsule Builder (pytest + Jupyter)
                       |
                       v
                Results: verified / unsupported / contradicted

Repository Structure

executable-claims/
├── backend/                # FastAPI service (claim mining, retrieval, entailment, capsule generation)
│   ├── core/               #   claim_miner, pdf_parser, evidence_retriever, entailment_checker, capsule_builder
│   ├── services/           #   LLM (OpenAI + Anthropic), arXiv, Semantic Scholar, retrieval
│   ├── api/                #   REST endpoints (/analyze, /claims, /capsules, /health)
│   └── config.py           #   Environment-based configuration
│
├── frontend/               # React + TypeScript + Vite + Tailwind demo UI
│   └── src/                #   Demo page (paste manuscript), Results page (annotated claims)
│
├── paper_scripts/          # Experiment & analysis scripts
│   ├── compute_statistics.py               # Bootstrap CIs, paired tests, effect sizes
│   ├── model_comparison.py                 # GPT-4o / GPT-5 / Claude Sonnet / Opus comparison
│   ├── llm_calibration.py                  # LLM-as-judge vs human (Cohen's kappa)
│   ├── tolerance_sensitivity.py            # Numeric tolerance sweep (1%-20%)
│   ├── enhanced_baselines.py               # Cross-encoder, BM25+RM3, oracle baselines
│   ├── generate_figures_enhanced.py        # Publication-quality figures with CIs
│   ├── expanded_entailment_calibration.py  # Real API calibration (N=220)
│   ├── test_third_party_capsules.py        # Capsule generation on third-party arXiv papers
│   ├── reviewer_verification.py            # Automated result consistency checks
│   └── ...                                 # + 12 more scripts
│
├── research_paper/         # Experimental data & figures (paper source excluded)
│   ├── data/               #   16 JSON + CSV result files (ablation, baselines, calibration, etc.)
│   ├── figures/            #   12 publication-quality PNG figures
│   └── REPRODUCIBILITY.md  #   Reproducibility guide
│
├── benchmark/              # Evaluation framework
│   ├── annotations/        #   Gold-standard annotation schema
│   └── scripts/            #   Groundedness@k, Span F1 evaluation
│
├── capsules/               # Evidence capsule artifacts
│   ├── examples/           #   Example pytest capsule
│   └── templates/          #   Capsule boilerplate template
│
├── inkvell-sdk/            # TypeScript SDK for editor integration
├── docker/                 # Dockerfile + docker-compose
├── scripts/                # setup.sh, test_system.py
└── docs/                   # API.md, INTEGRATION.md

Quick Start

Prerequisites

Python 3.11+
Node.js 18+
API keys: OpenAI, Anthropic (optional), Semantic Scholar (optional)

Backend

cd backend
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt
cp .env.example .env   # then fill in your API keys
python main.py         # http://localhost:8000

Frontend

cd frontend
npm install
npm run dev            # http://localhost:5173

Docker

cd docker
docker-compose up      # backend :8000, frontend :5173

Verify

curl http://localhost:8000/api/v1/health
python scripts/test_system.py

Key Results

Evaluated on a 500-claim benchmark across 67 papers spanning ML, biomedicine, and physics:

Metric	Score
Groundedness@5 (retrieval)	89.2%
Entailment Accuracy	87.3%
Numeric Pass Rate (within +/-5%)	91.7%
Capsule Execution Success	94.2%
LLM-Human Agreement (Cohen's kappa)	0.72

See research_paper/data/ for full experimental data and paper_scripts/ to reproduce all results.

API

POST /api/v1/analyze -- Submit manuscript text for claim extraction and verification

{
  "text": "Our model achieves 95.3% accuracy on ImageNet...",
  "extract_counterevidence": true,
  "generate_capsules": true
}

GET /api/v1/analysis/{id} -- Retrieve results (claims, evidence, capsules, stats)

GET /api/v1/capsule/{claim_id}/download?format=python -- Download capsule as .py or .ipynb

Full API reference: docs/API.md

Reproducing Experiments

cd paper_scripts
pip install -r requirements.txt

# Run all statistics
python compute_statistics.py

# Generate all figures
python generate_figures_enhanced.py

# Model comparison (requires API keys)
python model_comparison.py

# Verify result consistency
python verify_all_results.py

License

MIT -- see LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Executable Claims

Overview

Repository Structure

Quick Start

Prerequisites

Backend

Frontend

Docker

Verify

Key Results

API

Reproducing Experiments

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
backend		backend
benchmark		benchmark
capsules		capsules
docker		docker
docs		docs
frontend		frontend
inkvell-sdk		inkvell-sdk
paper_scripts		paper_scripts
research_paper		research_paper
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Executable Claims

Overview

Repository Structure

Quick Start

Prerequisites

Backend

Frontend

Docker

Verify

Key Results

API

Reproducing Experiments

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages