FOLIO First-Order Logic Benchmark

A neurosymbolic benchmark runner for the FOLIO (First-Order Logic Inference and Reasoning) dataset. Uses a PlanExecute + Z3 architecture where an LLM plans the logical formalization and a Z3 theorem prover deterministically checks entailment.

What is FOLIO?

FOLIO is a challenging logical reasoning benchmark from Yale. Each problem gives a set of natural language premises and a conclusion, and the task is to classify the conclusion as True, False, or Uncertain based on the premises alone.

Example problem:

Premises:

All customers who bought a laptop also bought a charger.

Some students are customers who bought a laptop.

No students bought a charger.

Conclusion: No students are customers who bought a laptop.

Label: True

The benchmark tests whether systems can perform multi-step deductive reasoning — something LLMs notoriously struggle with when relying on pattern matching alone.

Why Neurosymbolic?

LLMs are powerful pattern matchers, but logical reasoning demands soundness — every inference step must be valid, not just plausible. On FOLIO, even GPT-4 tops out at 61–69% with standard prompting techniques (zero-shot, chain-of-thought, self-consistency). Scaling the model isn't enough.

This project takes a different approach: pair a smaller open-source model with a formal theorem prover. The LLM translates natural language into first-order logic; Z3 handles the actual reasoning. The result:

Setup	Accuracy	Split
GPT-4 zero-shot	61.3%	Test
GPT-4 + Chain-of-Thought	68.9%	Test
GPT-4 + Logic-LM (neurosymbolic)	78.1%	Test
gpt-oss-120b + Z3 (this project)	89.2%	Validation

GPT-4 numbers are on the test split (226 problems); ours is on validation (203 problems). See detailed comparison below.

Even accounting for the split difference, the gap is large. A 120B-parameter open-source model with symbolic grounding outperforms GPT-4 across every prompting strategy — because the theorem prover guarantees that if the formalization is correct, the answer is correct. The LLM only needs to get the translation right, not the reasoning.

Results

Validation Run (Full Validation Set)

Model: gpt-oss-120b — a 120B-parameter open-source model served via Fireworks AI.

Metric	Value
Overall Accuracy	89.2% (181/203)
Plan Success Rate	100.0%
Elapsed Time	699.4s (~11.7 min)

Per-Label Accuracy:

Label	Correct / Total	Accuracy
Uncertain	67 / 69	97.10%
True	63 / 72	87.50%
False	51 / 62	82.26%

Confusion Matrix:

Ground Truth \ Predicted	True	False	Uncertain
True	63	0	9
False	4	51	7
Uncertain	2	0	67

Error pattern: The model exhibits conservative bias — 16 of 22 errors involve predicting "Uncertain" when a definitive answer (True/False) was correct.

Key Techniques

Two-pass normalization: An LLM preprocessing step rewrites premises to resolve vocabulary inconsistencies, coreferences, and ambiguous implication direction. Uses a two-pass strategy — solve first with original premises, retry with normalized premises only when the solver returns Uncertain — to capture normalization gains without regressing previously-correct answers.
Decomposition examples: 14 worked examples teach the LLM common FOL translation patterns including universal/existential quantification, binary predicates, disjunctions, scope restrictions, implicit type assertions, modal hedging, distinct entities, subtype bridges, exclusive-or, neither/nor vs disjunctive negation disambiguation, and numerical comparison encoding.
Bridge repair guards: Prevents the speculative subtype bridge mechanism from generating reverse-direction or sibling bridges that cause false entailments.
Disjointness repair: Auto-generates mutual exclusivity constraints for Or-disjunctions.
Premise consistency check: Detects contradictory premises to prevent ex falso quodlibet.

Comparison with Published Results

Test-split results (226 problems) — cross-split comparison, directional only:

Method	Model	Accuracy
Human Experts	—	~96%
Logic-LM	GPT-4	78.1%
DetermLR	GPT-4	77.5%
LINC	GPT-4	73.1%
Tree-of-Thought	GPT-4	70.0%
CoT + Self-Consistency	GPT-4	69.5%
Chain-of-Thought	GPT-4	68.9%
Flan-T5-Large (fine-tuned)	—	65.9%
8-shot	GPT-4	64.2%
Zero-shot	GPT-4	61.3%
8-shot	GPT-3.5-Turbo	58.3%

Validation-split results (203–204 problems) — direct comparison:

Method	Model	Accuracy	Source
PlanExecute + Z3	gpt-oss-120b	89.2%	Ours
CoT	Gemma 3 12B	77.45%	Robustness of Neurosymbolic Reasoners (2025)
CoT	Qwen 2.5 32B	75.49%	"
NSCoT	Qwen 2.5 32B	71.08%	"
CoT	Qwen 2.5 7B	70.59%	"
CoT	Llama 3.1 8B	70.59%	"
LINC (neurosymbolic)	Qwen 2.5 32B	68.14%	"

On the same validation split, PlanExecute + Z3 outperforms both pure CoT and other neurosymbolic methods (LINC, NSCoT) by a wide margin.

Prerequisites

Python 3.12+
uv — fast Python package manager
opensymbolicai-core — the core PlanExecute framework (available on PyPI)
An API key for at least one LLM provider

Installation

git clone https://github.com/OpenSymbolicAI/benchmark-py-folio.git
cd benchmark-py-folio
uv sync

Configuration

Create a .env file in the project root with your API keys:

# At least one of these is required
FIREWORKS_API_KEY=your-key-here
GROQ_API_KEY=your-key-here
ANTHROPIC_API_KEY=your-key-here

Usage

Quick Start

# Run the benchmark with defaults (Fireworks, gpt-oss-120b, validation split)
uv run folio-bench

# Or via python
uv run python main.py

CLI Options

uv run folio-bench [OPTIONS]

Options:
  --model MODEL          Model name/ID
                         (default: accounts/fireworks/models/gpt-oss-120b)
  --provider PROVIDER    LLM provider: ollama | openai | anthropic | fireworks | groq
                         (default: fireworks)
  --split SPLIT          Dataset split: train | validation | test
                         (default: validation)
  --num N                Number of problems to evaluate (default: all)
  --seed SEED            Random seed for shuffling (default: 42)
  --max-plan-retries N   Max retries for invalid plans (default: 5)
  --parallel N           Number of parallel workers (default: 10)
  --no-shuffle           Do not shuffle the dataset
  --normalize            Enable two-pass premise normalization

Examples

# Run on full validation set with normalization (recommended)
uv run folio-bench --normalize

# Run 50 problems from the train split
uv run folio-bench --split train --num 50

# Use Anthropic Claude
uv run folio-bench --provider anthropic --model claude-sonnet-4-20250514

# Use a local Ollama model
uv run folio-bench --provider ollama --model llama3

# Run with 20 parallel workers
uv run folio-bench --parallel 20

# Deterministic order (no shuffle)
uv run folio-bench --no-shuffle

Output

Results are saved to logs/<timestamp>_<split>_<model>/:

logs/20260216_014635_validation_gpt-oss-120b/
  summary.json          # Aggregate metrics, confusion matrix, config
  problem_0001.md       # Per-problem log with premises, conclusion, plan, result
  problem_0002.md
  ...

summary.json

Contains overall accuracy, per-label breakdown, confusion matrix, plan success rate, elapsed time, and the full run configuration.

problem_NNNN.md

Each file contains:

Problem premises and conclusion
The LLM-generated Z3 plan (Python code)
Predicted vs ground truth label
Any errors encountered

Architecture

User runs folio-bench
        │
        ▼
  ┌──────────────┐     ┌─────────────────────┐
  │  Data Loader  │────▶│  FOLIO Dataset (HF)  │
  │  (data.py)    │     │  tasksource/folio    │
  └──────┬───────┘     └─────────────────────┘
         │
         ▼
  ┌──────────────┐
  │   Runner      │  Parallel evaluation with ProcessPoolExecutor
  │  (runner.py)  │
  └──────┬───────┘
         │  for each problem
         ▼
  ┌──────────────┐     ┌─────────┐
  │  FOLIOAgent   │────▶│   LLM   │  Plans FOL translation
  │  (agent.py)   │     └─────────┘
  │               │
  │  @primitive   │────▶ Z3 Solver  Executes plan deterministically
  │  methods      │
  └──────┬───────┘
         │
         ▼  Pass 1 result
         │
    True / False ──────▶ Return answer
         │
    Uncertain ──┐
                │  (if --normalize)
                ▼
  ┌──────────────────┐     ┌─────────┐
  │  Normalizer       │────▶│   LLM   │  Rewrites premises
  │  (preprocess.py)  │     └─────────┘
  └──────┬───────────┘
         │
         ▼  Pass 2: re-solve with normalized premises
         │
    True / False / Uncertain

The PlanExecute pattern separates concerns:

Planning (LLM) — translate natural language premises and conclusion into Z3 primitive calls
Execution (Z3) — deterministically evaluate entailment using a theorem prover
Normalization (optional, LLM) — rewrites premises to resolve vocabulary inconsistencies, coreferences, and ambiguous implication direction before retrying

The two-pass strategy ensures normalization only helps, never hurts: the first pass solves with original premises, and normalization is only attempted when the solver returns Uncertain.

Available Z3 primitives include sort/constant declaration, unary/binary predicates, logical connectives (and, or, not, implies, iff), quantifiers (forall, exists), and entailment checking.

Tests

# Run unit tests
uv run pytest

# With coverage
uv run pytest --cov=folio_bench

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.claude		.claude
folio_bench		folio_bench
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FOLIO First-Order Logic Benchmark

What is FOLIO?

Why Neurosymbolic?

Results

Validation Run (Full Validation Set)

Key Techniques

Comparison with Published Results

Prerequisites

Installation

Configuration

Usage

Quick Start

CLI Options

Examples

Output

summary.json

problem_NNNN.md

Architecture

Tests

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

FOLIO First-Order Logic Benchmark

What is FOLIO?

Why Neurosymbolic?

Results

Validation Run (Full Validation Set)

Key Techniques

Comparison with Published Results

Prerequisites

Installation

Configuration

Usage

Quick Start

CLI Options

Examples

Output

summary.json

problem_NNNN.md

Architecture

Tests

References

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages