A writing-time assistant that extracts scientific claims from manuscripts, grounds them against cited evidence, and compiles executable test capsules for verification.
Paper: Executable Claims: A Writing-Time Framework for Automated Claim Verification via Evidence Capsules (AAAI 2026 Workshop on AI for Research)
Executable Claims operates as a real-time layer over the scientific writing process. As you write, it:
- Mines claims from manuscript text using LLM-based extraction with regex-backed validation
- Retrieves evidence from the cited literature via arXiv, Semantic Scholar, and embedded PDFs (hybrid BM25 + dense retrieval)
- Checks entailment through joint textual and numeric verification (unit-aware via
pint, default +/-5% tolerance) - Builds capsules -- lightweight, executable pytest scripts and Jupyter notebooks that encode each claim as a runnable test
- Surfaces counterevidence to reduce confirmation bias by flagging contradictions in the literature
Manuscript Text --> Claim Miner (GPT-4o / Claude)
|
v
Evidence Retriever (arXiv + S2 + BibTeX + PDF)
|
v
Entailment Checker (textual + numeric w/ unit normalization)
|
v
Capsule Builder (pytest + Jupyter)
|
v
Results: verified / unsupported / contradicted
executable-claims/
├── backend/ # FastAPI service (claim mining, retrieval, entailment, capsule generation)
│ ├── core/ # claim_miner, pdf_parser, evidence_retriever, entailment_checker, capsule_builder
│ ├── services/ # LLM (OpenAI + Anthropic), arXiv, Semantic Scholar, retrieval
│ ├── api/ # REST endpoints (/analyze, /claims, /capsules, /health)
│ └── config.py # Environment-based configuration
│
├── frontend/ # React + TypeScript + Vite + Tailwind demo UI
│ └── src/ # Demo page (paste manuscript), Results page (annotated claims)
│
├── paper_scripts/ # Experiment & analysis scripts
│ ├── compute_statistics.py # Bootstrap CIs, paired tests, effect sizes
│ ├── model_comparison.py # GPT-4o / GPT-5 / Claude Sonnet / Opus comparison
│ ├── llm_calibration.py # LLM-as-judge vs human (Cohen's kappa)
│ ├── tolerance_sensitivity.py # Numeric tolerance sweep (1%-20%)
│ ├── enhanced_baselines.py # Cross-encoder, BM25+RM3, oracle baselines
│ ├── generate_figures_enhanced.py # Publication-quality figures with CIs
│ ├── expanded_entailment_calibration.py # Real API calibration (N=220)
│ ├── test_third_party_capsules.py # Capsule generation on third-party arXiv papers
│ ├── reviewer_verification.py # Automated result consistency checks
│ └── ... # + 12 more scripts
│
├── research_paper/ # Experimental data & figures (paper source excluded)
│ ├── data/ # 16 JSON + CSV result files (ablation, baselines, calibration, etc.)
│ ├── figures/ # 12 publication-quality PNG figures
│ └── REPRODUCIBILITY.md # Reproducibility guide
│
├── benchmark/ # Evaluation framework
│ ├── annotations/ # Gold-standard annotation schema
│ └── scripts/ # Groundedness@k, Span F1 evaluation
│
├── capsules/ # Evidence capsule artifacts
│ ├── examples/ # Example pytest capsule
│ └── templates/ # Capsule boilerplate template
│
├── inkvell-sdk/ # TypeScript SDK for editor integration
├── docker/ # Dockerfile + docker-compose
├── scripts/ # setup.sh, test_system.py
└── docs/ # API.md, INTEGRATION.md
- Python 3.11+
- Node.js 18+
- API keys: OpenAI, Anthropic (optional), Semantic Scholar (optional)
cd backend
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt
cp .env.example .env # then fill in your API keys
python main.py # http://localhost:8000cd frontend
npm install
npm run dev # http://localhost:5173cd docker
docker-compose up # backend :8000, frontend :5173curl http://localhost:8000/api/v1/health
python scripts/test_system.pyEvaluated on a 500-claim benchmark across 67 papers spanning ML, biomedicine, and physics:
| Metric | Score |
|---|---|
| Groundedness@5 (retrieval) | 89.2% |
| Entailment Accuracy | 87.3% |
| Numeric Pass Rate (within +/-5%) | 91.7% |
| Capsule Execution Success | 94.2% |
| LLM-Human Agreement (Cohen's kappa) | 0.72 |
See research_paper/data/ for full experimental data and paper_scripts/ to reproduce all results.
POST /api/v1/analyze -- Submit manuscript text for claim extraction and verification
{
"text": "Our model achieves 95.3% accuracy on ImageNet...",
"extract_counterevidence": true,
"generate_capsules": true
}GET /api/v1/analysis/{id} -- Retrieve results (claims, evidence, capsules, stats)
GET /api/v1/capsule/{claim_id}/download?format=python -- Download capsule as .py or .ipynb
Full API reference: docs/API.md
cd paper_scripts
pip install -r requirements.txt
# Run all statistics
python compute_statistics.py
# Generate all figures
python generate_figures_enhanced.py
# Model comparison (requires API keys)
python model_comparison.py
# Verify result consistency
python verify_all_results.pyMIT -- see LICENSE.