A neuro-symbolic tool that combines LLMs with symbolic solvers for logical reasoning using multi-path ensemble with sketched planning, code generation, and path pruning.
Paper (Arxiv): Draft-and-Prune: Improving the Reliability of Auto-formalization for Logical Reasoning
Related work (ICLR 2026: Agents in the wild workshop): Agentified Assessment of Logical Reasoning Agents: an assessor agent runs the benchmark loop so evaluation stays reproducible through symbolic failures; the system under test only needs a standard A2A interface.
Benchmarks we released: We fixed and republished two widely used logical-reasoning benchmark — FOLIO and AR-LSAT.
Leaderboard: A live leaderboard for comparing frontier models on formal logical reasoning benchmarks such as AR-LSAT and ProverQA Hard under a unified zero-shot adaptive-agent protocol: https://zyni2001.github.io/draft-and-prune/
Accuracy vs. number of sampled paths (Ours vs. CoT-SC) on the cleaned AR-LSAT split, for GPT-5.2 (left) and Gemini-3-flash (right).
Repository path (for clones): assets/AR-LSAT_dp_vs_cotsc_gpt52_gemini3flash_social.mp4.
git clone https://github.com/zyni2001/draft-and-prune.git
cd draft-and-prune
pip install -r requirements.txt
# Run experiment with a provider-specific template
python main.py configs/azure-openai/config_azure_openai.yamlThe old root-level config_gpt.yaml was an Azure OpenAI template. Provider-specific templates now live under configs/.
├── main.py # Main entry point
├── config.py # Configuration parsing
├── reasoners.py # Reasoning logic (CoT, two-step, one-step)
├── answer_extractors.py # Dataset-specific answer extraction for CoT runs
├── call_api.py # API clients (GPT-4, Gemini)
├── data_loaders.py # Dataset loading
├── path_level_analysis.py # Path-level analysis
├── analysis_pruning_and_ensemble_simulation.py # Majority voting simulation
│
├── configs/
│ ├── azure-openai/ # Azure OpenAI templates
│ ├── openai-compatible/ # OpenAI-compatible templates
│ ├── gemini/ # Gemini templates
│ └── anthropic/ # Anthropic templates
│
├── prompts-all-3-shot-aligned/ # Prompt templates
│ └── {Dataset}-prompts-{method}/ # CoT, one-step, two-step-partition
│
├── data/ # Datasets (AR-LSAT, ProofWriter, ProntoQA, LogicalDeduction)
└── results/ # Experiment outputs
Step 1: Run Experiment
python main.py configs/azure-openai/config_azure_openai.yaml
→ results/{experiment}/summary/*.json
Step 2: Path-Level Analysis
python path_level_analysis.py <results_dir> --expected-paths 20
→ path_level_analysis.csv
Step 3: Majority Voting Simulation
python analysis_pruning_and_ensemble_simulation.py path_level_analysis.csv --pruning-mode both
→ simulation results with accuracy, exec_rate, exec_acc, hit_rate for k=1..20
# Basic usage
python path_level_analysis.py <results_dir> --expected-paths 20
# With CoT backup
python path_level_analysis.py <results_dir> --expected-paths 20 \
--cot-summary path/to/cot_summary.txt --cot-backupOptions:
--expected-paths N: Paths per sample (default: 1)--output-format: csv, xlsx, or both (default: csv)--cot-summary: Addcot_correctnesscolumn from CoT results--cot-backup: Use CoT as backup for syntax errors
Output: path_level_analysis.csv with path correctness, pruning results, and statistics.
# Without pruning
python analysis_pruning_and_ensemble_simulation.py input.csv --pruning-mode no_pruning
# With pruning + CoT fallback
python analysis_pruning_and_ensemble_simulation.py input.csv --pruning-mode both --cot-backupOptions:
--pruning-mode:no_pruning,existence,uniqueness,both, orall--cot-backup: Use CoT result when no valid symbolic output--max-paths: Maximum k value (default: 20)--num-simulations: Iterations per k (default: 10)
Output Metrics:
| Metric | Formula | Description |
|---|---|---|
accuracy |
correct / total | Overall accuracy |
exec_rate |
exec_samples / total | Fraction with valid outputs |
exec_acc |
exec_correct / exec_samples | Accuracy among executable |
hit_rate |
hit_samples / total | Fraction with ≥1 correct path |
| Dataset | Source |
|---|---|
| AR-LSAT (original) | HuggingFace |
| AR-LSAT — cleaned 229-sample split | HuggingFace |
| FOLIO — cleaned (FOLIO-Refined) | HuggingFace |
| ProofWriter | HuggingFace |
| ProntoQA | HuggingFace |
| LogicalDeduction | HuggingFace |
- Azure OpenAI:
configs/azure-openai/config_azure_openai.yaml - OpenAI-compatible:
configs/openai-compatible/config_openai_compatible.yaml - Gemini:
configs/gemini/config_gemini.yaml - Anthropic:
configs/anthropic/config_anthropic.yaml
If you use this repository, please cite:
@article{ni2026draft,
title={Draft-and-Prune: Improving the Reliability of Auto-formalization for Logical Reasoning},
author={Ni, Zhiyu and Liang, Zheng and Song, Liangcheng and Cao, Chenrui and Zhang, Xian and Sangiovanni-Vincentelli, Alberto and Nuzzo, Pierluigi},
journal={arXiv preprint arXiv:2603.17233},
year={2026}
}If you use the FOLIO-Refined dataset or the agentified assessment framework, please also cite:
@article{ni2026agentified,
title={Agentified Assessment of Logical Reasoning Agents},
author={Ni, Zhiyu and Xiao, Yifeng and Liang, Zheng},
journal={arXiv preprint arXiv:2603.02788},
year={2026}
}[Include your license information here]