A neurosymbolic benchmark runner for the FOLIO (First-Order Logic Inference and Reasoning) dataset. Uses a PlanExecute + Z3 architecture where an LLM plans the logical formalization and a Z3 theorem prover deterministically checks entailment.
FOLIO is a challenging logical reasoning benchmark from Yale. Each problem gives a set of natural language premises and a conclusion, and the task is to classify the conclusion as True, False, or Uncertain based on the premises alone.
Example problem:
Premises:
- All customers who bought a laptop also bought a charger.
- Some students are customers who bought a laptop.
- No students bought a charger.
Conclusion: No students are customers who bought a laptop.
Label: True
The benchmark tests whether systems can perform multi-step deductive reasoning — something LLMs notoriously struggle with when relying on pattern matching alone.
LLMs are powerful pattern matchers, but logical reasoning demands soundness — every inference step must be valid, not just plausible. On FOLIO, even GPT-4 tops out at 61–69% with standard prompting techniques (zero-shot, chain-of-thought, self-consistency). Scaling the model isn't enough.
This project takes a different approach: pair a smaller open-source model with a formal theorem prover. The LLM translates natural language into first-order logic; Z3 handles the actual reasoning. The result:
| Setup | Accuracy | Split |
|---|---|---|
| GPT-4 zero-shot | 61.3% | Test |
| GPT-4 + Chain-of-Thought | 68.9% | Test |
| GPT-4 + Logic-LM (neurosymbolic) | 78.1% | Test |
| gpt-oss-120b + Z3 (this project) | 89.2% | Validation |
GPT-4 numbers are on the test split (226 problems); ours is on validation (203 problems). See detailed comparison below.
Even accounting for the split difference, the gap is large. A 120B-parameter open-source model with symbolic grounding outperforms GPT-4 across every prompting strategy — because the theorem prover guarantees that if the formalization is correct, the answer is correct. The LLM only needs to get the translation right, not the reasoning.
Model: gpt-oss-120b — a 120B-parameter open-source model served via Fireworks AI.
| Metric | Value |
|---|---|
| Overall Accuracy | 89.2% (181/203) |
| Plan Success Rate | 100.0% |
| Elapsed Time | 699.4s (~11.7 min) |
Per-Label Accuracy:
| Label | Correct / Total | Accuracy |
|---|---|---|
| Uncertain | 67 / 69 | 97.10% |
| True | 63 / 72 | 87.50% |
| False | 51 / 62 | 82.26% |
Confusion Matrix:
| Ground Truth \ Predicted | True | False | Uncertain |
|---|---|---|---|
| True | 63 | 0 | 9 |
| False | 4 | 51 | 7 |
| Uncertain | 2 | 0 | 67 |
Error pattern: The model exhibits conservative bias — 16 of 22 errors involve predicting "Uncertain" when a definitive answer (True/False) was correct.
- Two-pass normalization: An LLM preprocessing step rewrites premises to resolve vocabulary inconsistencies, coreferences, and ambiguous implication direction. Uses a two-pass strategy — solve first with original premises, retry with normalized premises only when the solver returns Uncertain — to capture normalization gains without regressing previously-correct answers.
- Decomposition examples: 14 worked examples teach the LLM common FOL translation patterns including universal/existential quantification, binary predicates, disjunctions, scope restrictions, implicit type assertions, modal hedging, distinct entities, subtype bridges, exclusive-or, neither/nor vs disjunctive negation disambiguation, and numerical comparison encoding.
- Bridge repair guards: Prevents the speculative subtype bridge mechanism from generating reverse-direction or sibling bridges that cause false entailments.
- Disjointness repair: Auto-generates mutual exclusivity constraints for Or-disjunctions.
- Premise consistency check: Detects contradictory premises to prevent ex falso quodlibet.
Test-split results (226 problems) — cross-split comparison, directional only:
| Method | Model | Accuracy |
|---|---|---|
| Human Experts | — | ~96% |
| Logic-LM | GPT-4 | 78.1% |
| DetermLR | GPT-4 | 77.5% |
| LINC | GPT-4 | 73.1% |
| Tree-of-Thought | GPT-4 | 70.0% |
| CoT + Self-Consistency | GPT-4 | 69.5% |
| Chain-of-Thought | GPT-4 | 68.9% |
| Flan-T5-Large (fine-tuned) | — | 65.9% |
| 8-shot | GPT-4 | 64.2% |
| Zero-shot | GPT-4 | 61.3% |
| 8-shot | GPT-3.5-Turbo | 58.3% |
Validation-split results (203–204 problems) — direct comparison:
| Method | Model | Accuracy | Source |
|---|---|---|---|
| PlanExecute + Z3 | gpt-oss-120b | 89.2% | Ours |
| CoT | Gemma 3 12B | 77.45% | Robustness of Neurosymbolic Reasoners (2025) |
| CoT | Qwen 2.5 32B | 75.49% | " |
| NSCoT | Qwen 2.5 32B | 71.08% | " |
| CoT | Qwen 2.5 7B | 70.59% | " |
| CoT | Llama 3.1 8B | 70.59% | " |
| LINC (neurosymbolic) | Qwen 2.5 32B | 68.14% | " |
On the same validation split, PlanExecute + Z3 outperforms both pure CoT and other neurosymbolic methods (LINC, NSCoT) by a wide margin.
- Python 3.12+
- uv — fast Python package manager
- opensymbolicai-core — the core PlanExecute framework (available on PyPI)
- An API key for at least one LLM provider
git clone https://github.com/OpenSymbolicAI/benchmark-py-folio.git
cd benchmark-py-folio
uv syncCreate a .env file in the project root with your API keys:
# At least one of these is required
FIREWORKS_API_KEY=your-key-here
GROQ_API_KEY=your-key-here
ANTHROPIC_API_KEY=your-key-here# Run the benchmark with defaults (Fireworks, gpt-oss-120b, validation split)
uv run folio-bench
# Or via python
uv run python main.pyuv run folio-bench [OPTIONS]
Options:
--model MODEL Model name/ID
(default: accounts/fireworks/models/gpt-oss-120b)
--provider PROVIDER LLM provider: ollama | openai | anthropic | fireworks | groq
(default: fireworks)
--split SPLIT Dataset split: train | validation | test
(default: validation)
--num N Number of problems to evaluate (default: all)
--seed SEED Random seed for shuffling (default: 42)
--max-plan-retries N Max retries for invalid plans (default: 5)
--parallel N Number of parallel workers (default: 10)
--no-shuffle Do not shuffle the dataset
--normalize Enable two-pass premise normalization
# Run on full validation set with normalization (recommended)
uv run folio-bench --normalize
# Run 50 problems from the train split
uv run folio-bench --split train --num 50
# Use Anthropic Claude
uv run folio-bench --provider anthropic --model claude-sonnet-4-20250514
# Use a local Ollama model
uv run folio-bench --provider ollama --model llama3
# Run with 20 parallel workers
uv run folio-bench --parallel 20
# Deterministic order (no shuffle)
uv run folio-bench --no-shuffleResults are saved to logs/<timestamp>_<split>_<model>/:
logs/20260216_014635_validation_gpt-oss-120b/
summary.json # Aggregate metrics, confusion matrix, config
problem_0001.md # Per-problem log with premises, conclusion, plan, result
problem_0002.md
...
Contains overall accuracy, per-label breakdown, confusion matrix, plan success rate, elapsed time, and the full run configuration.
Each file contains:
- Problem premises and conclusion
- The LLM-generated Z3 plan (Python code)
- Predicted vs ground truth label
- Any errors encountered
User runs folio-bench
│
▼
┌──────────────┐ ┌─────────────────────┐
│ Data Loader │────▶│ FOLIO Dataset (HF) │
│ (data.py) │ │ tasksource/folio │
└──────┬───────┘ └─────────────────────┘
│
▼
┌──────────────┐
│ Runner │ Parallel evaluation with ProcessPoolExecutor
│ (runner.py) │
└──────┬───────┘
│ for each problem
▼
┌──────────────┐ ┌─────────┐
│ FOLIOAgent │────▶│ LLM │ Plans FOL translation
│ (agent.py) │ └─────────┘
│ │
│ @primitive │────▶ Z3 Solver Executes plan deterministically
│ methods │
└──────┬───────┘
│
▼ Pass 1 result
│
True / False ──────▶ Return answer
│
Uncertain ──┐
│ (if --normalize)
▼
┌──────────────────┐ ┌─────────┐
│ Normalizer │────▶│ LLM │ Rewrites premises
│ (preprocess.py) │ └─────────┘
└──────┬───────────┘
│
▼ Pass 2: re-solve with normalized premises
│
True / False / Uncertain
The PlanExecute pattern separates concerns:
- Planning (LLM) — translate natural language premises and conclusion into Z3 primitive calls
- Execution (Z3) — deterministically evaluate entailment using a theorem prover
- Normalization (optional, LLM) — rewrites premises to resolve vocabulary inconsistencies, coreferences, and ambiguous implication direction before retrying
The two-pass strategy ensures normalization only helps, never hurts: the first pass solves with original premises, and normalization is only attempted when the solver returns Uncertain.
Available Z3 primitives include sort/constant declaration, unary/binary predicates, logical connectives (and, or, not, implies, iff), quantifiers (forall, exists), and entailment checking.
# Run unit tests
uv run pytest
# With coverage
uv run pytest --cov=folio_bench