Skip to content

aayambansal/ExecutableClaims

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Executable Claims

A writing-time assistant that extracts scientific claims from manuscripts, grounds them against cited evidence, and compiles executable test capsules for verification.

Paper: Executable Claims: A Writing-Time Framework for Automated Claim Verification via Evidence Capsules (AAAI 2026 Workshop on AI for Research)


Overview

Executable Claims operates as a real-time layer over the scientific writing process. As you write, it:

  1. Mines claims from manuscript text using LLM-based extraction with regex-backed validation
  2. Retrieves evidence from the cited literature via arXiv, Semantic Scholar, and embedded PDFs (hybrid BM25 + dense retrieval)
  3. Checks entailment through joint textual and numeric verification (unit-aware via pint, default +/-5% tolerance)
  4. Builds capsules -- lightweight, executable pytest scripts and Jupyter notebooks that encode each claim as a runnable test
  5. Surfaces counterevidence to reduce confirmation bias by flagging contradictions in the literature
Manuscript Text --> Claim Miner (GPT-4o / Claude)
                       |
                       v
                Evidence Retriever (arXiv + S2 + BibTeX + PDF)
                       |
                       v
                Entailment Checker (textual + numeric w/ unit normalization)
                       |
                       v
                Capsule Builder (pytest + Jupyter)
                       |
                       v
                Results: verified / unsupported / contradicted

Repository Structure

executable-claims/
├── backend/                # FastAPI service (claim mining, retrieval, entailment, capsule generation)
│   ├── core/               #   claim_miner, pdf_parser, evidence_retriever, entailment_checker, capsule_builder
│   ├── services/           #   LLM (OpenAI + Anthropic), arXiv, Semantic Scholar, retrieval
│   ├── api/                #   REST endpoints (/analyze, /claims, /capsules, /health)
│   └── config.py           #   Environment-based configuration
│
├── frontend/               # React + TypeScript + Vite + Tailwind demo UI
│   └── src/                #   Demo page (paste manuscript), Results page (annotated claims)
│
├── paper_scripts/          # Experiment & analysis scripts
│   ├── compute_statistics.py               # Bootstrap CIs, paired tests, effect sizes
│   ├── model_comparison.py                 # GPT-4o / GPT-5 / Claude Sonnet / Opus comparison
│   ├── llm_calibration.py                  # LLM-as-judge vs human (Cohen's kappa)
│   ├── tolerance_sensitivity.py            # Numeric tolerance sweep (1%-20%)
│   ├── enhanced_baselines.py               # Cross-encoder, BM25+RM3, oracle baselines
│   ├── generate_figures_enhanced.py        # Publication-quality figures with CIs
│   ├── expanded_entailment_calibration.py  # Real API calibration (N=220)
│   ├── test_third_party_capsules.py        # Capsule generation on third-party arXiv papers
│   ├── reviewer_verification.py            # Automated result consistency checks
│   └── ...                                 # + 12 more scripts
│
├── research_paper/         # Experimental data & figures (paper source excluded)
│   ├── data/               #   16 JSON + CSV result files (ablation, baselines, calibration, etc.)
│   ├── figures/            #   12 publication-quality PNG figures
│   └── REPRODUCIBILITY.md  #   Reproducibility guide
│
├── benchmark/              # Evaluation framework
│   ├── annotations/        #   Gold-standard annotation schema
│   └── scripts/            #   Groundedness@k, Span F1 evaluation
│
├── capsules/               # Evidence capsule artifacts
│   ├── examples/           #   Example pytest capsule
│   └── templates/          #   Capsule boilerplate template
│
├── inkvell-sdk/            # TypeScript SDK for editor integration
├── docker/                 # Dockerfile + docker-compose
├── scripts/                # setup.sh, test_system.py
└── docs/                   # API.md, INTEGRATION.md

Quick Start

Prerequisites

  • Python 3.11+
  • Node.js 18+
  • API keys: OpenAI, Anthropic (optional), Semantic Scholar (optional)

Backend

cd backend
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt
cp .env.example .env   # then fill in your API keys
python main.py         # http://localhost:8000

Frontend

cd frontend
npm install
npm run dev            # http://localhost:5173

Docker

cd docker
docker-compose up      # backend :8000, frontend :5173

Verify

curl http://localhost:8000/api/v1/health
python scripts/test_system.py

Key Results

Evaluated on a 500-claim benchmark across 67 papers spanning ML, biomedicine, and physics:

Metric Score
Groundedness@5 (retrieval) 89.2%
Entailment Accuracy 87.3%
Numeric Pass Rate (within +/-5%) 91.7%
Capsule Execution Success 94.2%
LLM-Human Agreement (Cohen's kappa) 0.72

See research_paper/data/ for full experimental data and paper_scripts/ to reproduce all results.


API

POST /api/v1/analyze -- Submit manuscript text for claim extraction and verification

{
  "text": "Our model achieves 95.3% accuracy on ImageNet...",
  "extract_counterevidence": true,
  "generate_capsules": true
}

GET /api/v1/analysis/{id} -- Retrieve results (claims, evidence, capsules, stats)

GET /api/v1/capsule/{claim_id}/download?format=python -- Download capsule as .py or .ipynb

Full API reference: docs/API.md


Reproducing Experiments

cd paper_scripts
pip install -r requirements.txt

# Run all statistics
python compute_statistics.py

# Generate all figures
python generate_figures_enhanced.py

# Model comparison (requires API keys)
python model_comparison.py

# Verify result consistency
python verify_all_results.py

License

MIT -- see LICENSE.

About

Writing-time claim verification: extracts scientific claims from manuscripts, grounds them against cited evidence, and compiles executable test capsules. (AAAI 2026)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors