Draft-and-Prune

A neuro-symbolic tool that combines LLMs with symbolic solvers for logical reasoning using multi-path ensemble with sketched planning, code generation, and path pruning.

Paper (Arxiv): Draft-and-Prune: Improving the Reliability of Auto-formalization for Logical Reasoning

Related work (ICLR 2026: Agents in the wild workshop): Agentified Assessment of Logical Reasoning Agents: an assessor agent runs the benchmark loop so evaluation stays reproducible through symbolic failures; the system under test only needs a standard A2A interface.

Benchmarks we released: We fixed and republished two widely used logical-reasoning benchmark — FOLIO and AR-LSAT.

Leaderboard: A live leaderboard for comparing frontier models on formal logical reasoning benchmarks such as AR-LSAT and ProverQA Hard under a unified zero-shot adaptive-agent protocol: https://zyni2001.github.io/draft-and-prune/

Results preview (AR-LSAT)

Accuracy vs. number of sampled paths (Ours vs. CoT-SC) on the cleaned AR-LSAT split, for GPT-5.2 (left) and Gemini-3-flash (right).

Repository path (for clones): assets/AR-LSAT_dp_vs_cotsc_gpt52_gemini3flash_social.mp4.

Quick Start

git clone https://github.com/zyni2001/draft-and-prune.git
cd draft-and-prune
pip install -r requirements.txt

# Run experiment with a provider-specific template
python main.py configs/azure-openai/config_azure_openai.yaml

The old root-level config_gpt.yaml was an Azure OpenAI template. Provider-specific templates now live under configs/.

Project Structure

├── main.py                     # Main entry point
├── config.py                   # Configuration parsing
├── reasoners.py                # Reasoning logic (CoT, two-step, one-step)
├── answer_extractors.py        # Dataset-specific answer extraction for CoT runs
├── call_api.py                 # API clients (GPT-4, Gemini)
├── data_loaders.py             # Dataset loading
├── path_level_analysis.py      # Path-level analysis
├── analysis_pruning_and_ensemble_simulation.py  # Majority voting simulation
│
├── configs/
│   ├── azure-openai/           # Azure OpenAI templates
│   ├── openai-compatible/      # OpenAI-compatible templates
│   ├── gemini/                 # Gemini templates
│   └── anthropic/              # Anthropic templates
│
├── prompts-all-3-shot-aligned/ # Prompt templates
│   └── {Dataset}-prompts-{method}/  # CoT, one-step, two-step-partition
│
├── data/                       # Datasets (AR-LSAT, ProofWriter, ProntoQA, LogicalDeduction)
└── results/                    # Experiment outputs

Analysis Workflow

Step 1: Run Experiment
  python main.py configs/azure-openai/config_azure_openai.yaml
  → results/{experiment}/summary/*.json

Step 2: Path-Level Analysis
  python path_level_analysis.py <results_dir> --expected-paths 20
  → path_level_analysis.csv

Step 3: Majority Voting Simulation
  python analysis_pruning_and_ensemble_simulation.py path_level_analysis.csv --pruning-mode both
  → simulation results with accuracy, exec_rate, exec_acc, hit_rate for k=1..20

Path-Level Analysis

# Basic usage
python path_level_analysis.py <results_dir> --expected-paths 20

# With CoT backup
python path_level_analysis.py <results_dir> --expected-paths 20 \
  --cot-summary path/to/cot_summary.txt --cot-backup

Options:

--expected-paths N: Paths per sample (default: 1)
--output-format: csv, xlsx, or both (default: csv)
--cot-summary: Add cot_correctness column from CoT results
--cot-backup: Use CoT as backup for syntax errors

Output: path_level_analysis.csv with path correctness, pruning results, and statistics.

Majority Voting Simulation

# Without pruning
python analysis_pruning_and_ensemble_simulation.py input.csv --pruning-mode no_pruning

# With pruning + CoT fallback
python analysis_pruning_and_ensemble_simulation.py input.csv --pruning-mode both --cot-backup

Options:

--pruning-mode: no_pruning, existence, uniqueness, both, or all
--cot-backup: Use CoT result when no valid symbolic output
--max-paths: Maximum k value (default: 20)
--num-simulations: Iterations per k (default: 10)

Output Metrics:

Metric	Formula	Description
`accuracy`	correct / total	Overall accuracy
`exec_rate`	exec_samples / total	Fraction with valid outputs
`exec_acc`	exec_correct / exec_samples	Accuracy among executable
`hit_rate`	hit_samples / total	Fraction with ≥1 correct path

Supported Datasets

Dataset	Source
AR-LSAT (original)	HuggingFace
AR-LSAT — cleaned 229-sample split	HuggingFace
FOLIO — cleaned (FOLIO-Refined)	HuggingFace
ProofWriter	HuggingFace
ProntoQA	HuggingFace
LogicalDeduction	HuggingFace

Config Templates

Azure OpenAI: configs/azure-openai/config_azure_openai.yaml
OpenAI-compatible: configs/openai-compatible/config_openai_compatible.yaml
Gemini: configs/gemini/config_gemini.yaml
Anthropic: configs/anthropic/config_anthropic.yaml

Citation

If you use this repository, please cite:

@article{ni2026draft,
  title={Draft-and-Prune: Improving the Reliability of Auto-formalization for Logical Reasoning},
  author={Ni, Zhiyu and Liang, Zheng and Song, Liangcheng and Cao, Chenrui and Zhang, Xian and Sangiovanni-Vincentelli, Alberto and Nuzzo, Pierluigi},
  journal={arXiv preprint arXiv:2603.17233},
  year={2026}
}

If you use the FOLIO-Refined dataset or the agentified assessment framework, please also cite:

@article{ni2026agentified,
  title={Agentified Assessment of Logical Reasoning Agents},
  author={Ni, Zhiyu and Xiao, Yifeng and Liang, Zheng},
  journal={arXiv preprint arXiv:2603.02788},
  year={2026}
}

License

[Include your license information here]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Draft-and-Prune

Results preview (AR-LSAT)

Quick Start

Project Structure

Analysis Workflow

Path-Level Analysis

Majority Voting Simulation

Supported Datasets

Config Templates

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 156 Commits
assets		assets
configs		configs
data		data
prompts-all-3-shot-aligned		prompts-all-3-shot-aligned
.gitignore		.gitignore
README.md		README.md
analysis_pruning_and_ensemble_simulation.py		analysis_pruning_and_ensemble_simulation.py
answer_extractors.py		answer_extractors.py
call_api.py		call_api.py
config.py		config.py
data_loaders.py		data_loaders.py
main.py		main.py
path_level_analysis.py		path_level_analysis.py
reasoners.py		reasoners.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Draft-and-Prune

Results preview (AR-LSAT)

Quick Start

Project Structure

Analysis Workflow

Path-Level Analysis

Majority Voting Simulation

Supported Datasets

Config Templates

Citation

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages