Skip to content

zyni2001/draft-and-prune

Repository files navigation

Draft-and-Prune

A neuro-symbolic tool that combines LLMs with symbolic solvers for logical reasoning using multi-path ensemble with sketched planning, code generation, and path pruning.

Paper (Arxiv): Draft-and-Prune: Improving the Reliability of Auto-formalization for Logical Reasoning

Related work (ICLR 2026: Agents in the wild workshop): Agentified Assessment of Logical Reasoning Agents: an assessor agent runs the benchmark loop so evaluation stays reproducible through symbolic failures; the system under test only needs a standard A2A interface.

Benchmarks we released: We fixed and republished two widely used logical-reasoning benchmark — FOLIO and AR-LSAT.

Leaderboard: A live leaderboard for comparing frontier models on formal logical reasoning benchmarks such as AR-LSAT and ProverQA Hard under a unified zero-shot adaptive-agent protocol: https://zyni2001.github.io/draft-and-prune/

Results preview (AR-LSAT)

Accuracy vs. number of sampled paths (Ours vs. CoT-SC) on the cleaned AR-LSAT split, for GPT-5.2 (left) and Gemini-3-flash (right).

Repository path (for clones): assets/AR-LSAT_dp_vs_cotsc_gpt52_gemini3flash_social.mp4.

Quick Start

git clone https://github.com/zyni2001/draft-and-prune.git
cd draft-and-prune
pip install -r requirements.txt

# Run experiment with a provider-specific template
python main.py configs/azure-openai/config_azure_openai.yaml

The old root-level config_gpt.yaml was an Azure OpenAI template. Provider-specific templates now live under configs/.

Project Structure

├── main.py                     # Main entry point
├── config.py                   # Configuration parsing
├── reasoners.py                # Reasoning logic (CoT, two-step, one-step)
├── answer_extractors.py        # Dataset-specific answer extraction for CoT runs
├── call_api.py                 # API clients (GPT-4, Gemini)
├── data_loaders.py             # Dataset loading
├── path_level_analysis.py      # Path-level analysis
├── analysis_pruning_and_ensemble_simulation.py  # Majority voting simulation
│
├── configs/
│   ├── azure-openai/           # Azure OpenAI templates
│   ├── openai-compatible/      # OpenAI-compatible templates
│   ├── gemini/                 # Gemini templates
│   └── anthropic/              # Anthropic templates
│
├── prompts-all-3-shot-aligned/ # Prompt templates
│   └── {Dataset}-prompts-{method}/  # CoT, one-step, two-step-partition
│
├── data/                       # Datasets (AR-LSAT, ProofWriter, ProntoQA, LogicalDeduction)
└── results/                    # Experiment outputs

Analysis Workflow

Step 1: Run Experiment
  python main.py configs/azure-openai/config_azure_openai.yaml
  → results/{experiment}/summary/*.json

Step 2: Path-Level Analysis
  python path_level_analysis.py <results_dir> --expected-paths 20
  → path_level_analysis.csv

Step 3: Majority Voting Simulation
  python analysis_pruning_and_ensemble_simulation.py path_level_analysis.csv --pruning-mode both
  → simulation results with accuracy, exec_rate, exec_acc, hit_rate for k=1..20

Path-Level Analysis

# Basic usage
python path_level_analysis.py <results_dir> --expected-paths 20

# With CoT backup
python path_level_analysis.py <results_dir> --expected-paths 20 \
  --cot-summary path/to/cot_summary.txt --cot-backup

Options:

  • --expected-paths N: Paths per sample (default: 1)
  • --output-format: csv, xlsx, or both (default: csv)
  • --cot-summary: Add cot_correctness column from CoT results
  • --cot-backup: Use CoT as backup for syntax errors

Output: path_level_analysis.csv with path correctness, pruning results, and statistics.


Majority Voting Simulation

# Without pruning
python analysis_pruning_and_ensemble_simulation.py input.csv --pruning-mode no_pruning

# With pruning + CoT fallback
python analysis_pruning_and_ensemble_simulation.py input.csv --pruning-mode both --cot-backup

Options:

  • --pruning-mode: no_pruning, existence, uniqueness, both, or all
  • --cot-backup: Use CoT result when no valid symbolic output
  • --max-paths: Maximum k value (default: 20)
  • --num-simulations: Iterations per k (default: 10)

Output Metrics:

Metric Formula Description
accuracy correct / total Overall accuracy
exec_rate exec_samples / total Fraction with valid outputs
exec_acc exec_correct / exec_samples Accuracy among executable
hit_rate hit_samples / total Fraction with ≥1 correct path

Supported Datasets

Dataset Source
AR-LSAT (original) HuggingFace
AR-LSAT — cleaned 229-sample split HuggingFace
FOLIO — cleaned (FOLIO-Refined) HuggingFace
ProofWriter HuggingFace
ProntoQA HuggingFace
LogicalDeduction HuggingFace

Config Templates

  • Azure OpenAI: configs/azure-openai/config_azure_openai.yaml
  • OpenAI-compatible: configs/openai-compatible/config_openai_compatible.yaml
  • Gemini: configs/gemini/config_gemini.yaml
  • Anthropic: configs/anthropic/config_anthropic.yaml

Citation

If you use this repository, please cite:

@article{ni2026draft,
  title={Draft-and-Prune: Improving the Reliability of Auto-formalization for Logical Reasoning},
  author={Ni, Zhiyu and Liang, Zheng and Song, Liangcheng and Cao, Chenrui and Zhang, Xian and Sangiovanni-Vincentelli, Alberto and Nuzzo, Pierluigi},
  journal={arXiv preprint arXiv:2603.17233},
  year={2026}
}

If you use the FOLIO-Refined dataset or the agentified assessment framework, please also cite:

@article{ni2026agentified,
  title={Agentified Assessment of Logical Reasoning Agents},
  author={Ni, Zhiyu and Xiao, Yifeng and Liang, Zheng},
  journal={arXiv preprint arXiv:2603.02788},
  year={2026}
}

License

[Include your license information here]

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages