Skip to content

OpenSymbolicAI/benchmark-py-folio

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OpenSymbolicAI

FOLIO First-Order Logic Benchmark

A neurosymbolic benchmark runner for the FOLIO (First-Order Logic Inference and Reasoning) dataset. Uses a PlanExecute + Z3 architecture where an LLM plans the logical formalization and a Z3 theorem prover deterministically checks entailment.

What is FOLIO?

FOLIO is a challenging logical reasoning benchmark from Yale. Each problem gives a set of natural language premises and a conclusion, and the task is to classify the conclusion as True, False, or Uncertain based on the premises alone.

Example problem:

Premises:

  1. All customers who bought a laptop also bought a charger.
  2. Some students are customers who bought a laptop.
  3. No students bought a charger.

Conclusion: No students are customers who bought a laptop.

Label: True

The benchmark tests whether systems can perform multi-step deductive reasoning — something LLMs notoriously struggle with when relying on pattern matching alone.

Why Neurosymbolic?

LLMs are powerful pattern matchers, but logical reasoning demands soundness — every inference step must be valid, not just plausible. On FOLIO, even GPT-4 tops out at 61–69% with standard prompting techniques (zero-shot, chain-of-thought, self-consistency). Scaling the model isn't enough.

This project takes a different approach: pair a smaller open-source model with a formal theorem prover. The LLM translates natural language into first-order logic; Z3 handles the actual reasoning. The result:

Setup Accuracy Split
GPT-4 zero-shot 61.3% Test
GPT-4 + Chain-of-Thought 68.9% Test
GPT-4 + Logic-LM (neurosymbolic) 78.1% Test
gpt-oss-120b + Z3 (this project) 89.2% Validation

GPT-4 numbers are on the test split (226 problems); ours is on validation (203 problems). See detailed comparison below.

Even accounting for the split difference, the gap is large. A 120B-parameter open-source model with symbolic grounding outperforms GPT-4 across every prompting strategy — because the theorem prover guarantees that if the formalization is correct, the answer is correct. The LLM only needs to get the translation right, not the reasoning.

Results

Validation Run (Full Validation Set)

Model: gpt-oss-120b — a 120B-parameter open-source model served via Fireworks AI.

Metric Value
Overall Accuracy 89.2% (181/203)
Plan Success Rate 100.0%
Elapsed Time 699.4s (~11.7 min)

Per-Label Accuracy:

Label Correct / Total Accuracy
Uncertain 67 / 69 97.10%
True 63 / 72 87.50%
False 51 / 62 82.26%

Confusion Matrix:

Ground Truth \ Predicted True False Uncertain
True 63 0 9
False 4 51 7
Uncertain 2 0 67

Error pattern: The model exhibits conservative bias — 16 of 22 errors involve predicting "Uncertain" when a definitive answer (True/False) was correct.

Key Techniques

  • Two-pass normalization: An LLM preprocessing step rewrites premises to resolve vocabulary inconsistencies, coreferences, and ambiguous implication direction. Uses a two-pass strategy — solve first with original premises, retry with normalized premises only when the solver returns Uncertain — to capture normalization gains without regressing previously-correct answers.
  • Decomposition examples: 14 worked examples teach the LLM common FOL translation patterns including universal/existential quantification, binary predicates, disjunctions, scope restrictions, implicit type assertions, modal hedging, distinct entities, subtype bridges, exclusive-or, neither/nor vs disjunctive negation disambiguation, and numerical comparison encoding.
  • Bridge repair guards: Prevents the speculative subtype bridge mechanism from generating reverse-direction or sibling bridges that cause false entailments.
  • Disjointness repair: Auto-generates mutual exclusivity constraints for Or-disjunctions.
  • Premise consistency check: Detects contradictory premises to prevent ex falso quodlibet.

Comparison with Published Results

Test-split results (226 problems) — cross-split comparison, directional only:

Method Model Accuracy
Human Experts ~96%
Logic-LM GPT-4 78.1%
DetermLR GPT-4 77.5%
LINC GPT-4 73.1%
Tree-of-Thought GPT-4 70.0%
CoT + Self-Consistency GPT-4 69.5%
Chain-of-Thought GPT-4 68.9%
Flan-T5-Large (fine-tuned) 65.9%
8-shot GPT-4 64.2%
Zero-shot GPT-4 61.3%
8-shot GPT-3.5-Turbo 58.3%

Validation-split results (203–204 problems) — direct comparison:

Method Model Accuracy Source
PlanExecute + Z3 gpt-oss-120b 89.2% Ours
CoT Gemma 3 12B 77.45% Robustness of Neurosymbolic Reasoners (2025)
CoT Qwen 2.5 32B 75.49% "
NSCoT Qwen 2.5 32B 71.08% "
CoT Qwen 2.5 7B 70.59% "
CoT Llama 3.1 8B 70.59% "
LINC (neurosymbolic) Qwen 2.5 32B 68.14% "

On the same validation split, PlanExecute + Z3 outperforms both pure CoT and other neurosymbolic methods (LINC, NSCoT) by a wide margin.

Prerequisites

  • Python 3.12+
  • uv — fast Python package manager
  • opensymbolicai-core — the core PlanExecute framework (available on PyPI)
  • An API key for at least one LLM provider

Installation

git clone https://github.com/OpenSymbolicAI/benchmark-py-folio.git
cd benchmark-py-folio
uv sync

Configuration

Create a .env file in the project root with your API keys:

# At least one of these is required
FIREWORKS_API_KEY=your-key-here
GROQ_API_KEY=your-key-here
ANTHROPIC_API_KEY=your-key-here

Usage

Quick Start

# Run the benchmark with defaults (Fireworks, gpt-oss-120b, validation split)
uv run folio-bench

# Or via python
uv run python main.py

CLI Options

uv run folio-bench [OPTIONS]

Options:
  --model MODEL          Model name/ID
                         (default: accounts/fireworks/models/gpt-oss-120b)
  --provider PROVIDER    LLM provider: ollama | openai | anthropic | fireworks | groq
                         (default: fireworks)
  --split SPLIT          Dataset split: train | validation | test
                         (default: validation)
  --num N                Number of problems to evaluate (default: all)
  --seed SEED            Random seed for shuffling (default: 42)
  --max-plan-retries N   Max retries for invalid plans (default: 5)
  --parallel N           Number of parallel workers (default: 10)
  --no-shuffle           Do not shuffle the dataset
  --normalize            Enable two-pass premise normalization

Examples

# Run on full validation set with normalization (recommended)
uv run folio-bench --normalize

# Run 50 problems from the train split
uv run folio-bench --split train --num 50

# Use Anthropic Claude
uv run folio-bench --provider anthropic --model claude-sonnet-4-20250514

# Use a local Ollama model
uv run folio-bench --provider ollama --model llama3

# Run with 20 parallel workers
uv run folio-bench --parallel 20

# Deterministic order (no shuffle)
uv run folio-bench --no-shuffle

Output

Results are saved to logs/<timestamp>_<split>_<model>/:

logs/20260216_014635_validation_gpt-oss-120b/
  summary.json          # Aggregate metrics, confusion matrix, config
  problem_0001.md       # Per-problem log with premises, conclusion, plan, result
  problem_0002.md
  ...

summary.json

Contains overall accuracy, per-label breakdown, confusion matrix, plan success rate, elapsed time, and the full run configuration.

problem_NNNN.md

Each file contains:

  • Problem premises and conclusion
  • The LLM-generated Z3 plan (Python code)
  • Predicted vs ground truth label
  • Any errors encountered

Architecture

User runs folio-bench
        │
        ▼
  ┌──────────────┐     ┌─────────────────────┐
  │  Data Loader  │────▶│  FOLIO Dataset (HF)  │
  │  (data.py)    │     │  tasksource/folio    │
  └──────┬───────┘     └─────────────────────┘
         │
         ▼
  ┌──────────────┐
  │   Runner      │  Parallel evaluation with ProcessPoolExecutor
  │  (runner.py)  │
  └──────┬───────┘
         │  for each problem
         ▼
  ┌──────────────┐     ┌─────────┐
  │  FOLIOAgent   │────▶│   LLM   │  Plans FOL translation
  │  (agent.py)   │     └─────────┘
  │               │
  │  @primitive   │────▶ Z3 Solver  Executes plan deterministically
  │  methods      │
  └──────┬───────┘
         │
         ▼  Pass 1 result
         │
    True / False ──────▶ Return answer
         │
    Uncertain ──┐
                │  (if --normalize)
                ▼
  ┌──────────────────┐     ┌─────────┐
  │  Normalizer       │────▶│   LLM   │  Rewrites premises
  │  (preprocess.py)  │     └─────────┘
  └──────┬───────────┘
         │
         ▼  Pass 2: re-solve with normalized premises
         │
    True / False / Uncertain

The PlanExecute pattern separates concerns:

  1. Planning (LLM) — translate natural language premises and conclusion into Z3 primitive calls
  2. Execution (Z3) — deterministically evaluate entailment using a theorem prover
  3. Normalization (optional, LLM) — rewrites premises to resolve vocabulary inconsistencies, coreferences, and ambiguous implication direction before retrying

The two-pass strategy ensures normalization only helps, never hurts: the first pass solves with original premises, and normalization is only attempted when the solver returns Uncertain.

Available Z3 primitives include sort/constant declaration, unary/binary predicates, logical connectives (and, or, not, implies, iff), quantifiers (forall, exists), and entailment checking.

Tests

# Run unit tests
uv run pytest

# With coverage
uv run pytest --cov=folio_bench

References

Releases

No releases published

Packages

 
 
 

Contributors

Languages