Logical Consistency Under Pressure: Probing and Repairing Cross-Query Contradictions in LLMs
ICLR 2026 Workshop on Logical Reasoning of Large Language Models
LLMs answer individual logic questions with reasonable accuracy, yet frequently contradict themselves across logically related queries. ConsistencyBench is a benchmark of 493 question sets (1,904 questions) spanning six categories of logical reasoning, designed to measure cross-query logical consistency.
We evaluate 18 frontier LLMs and find that even the best model (GPT-4.1) achieves only 46.7% set-level consistency despite 83.0% individual accuracy, revealing consistency gaps of 36-57 percentage points. We propose Consistency-Guided Decoding (CGD), a training-free inference-time method that improves consistency for 16/17 models (+6.6pp mean, up to +19.7pp).
| Model | Individual Accuracy | Set Consistency | Gap |
|---|---|---|---|
| GPT-4.1 | 83.0% | 46.7% | 36.4pp |
| Qwen 2.5 72B | 75.0% | 32.7% | 42.3pp |
| Claude Opus 4.6 | 74.8% | 23.3% | 51.4pp |
| DeepSeek-R1 | 70.2% | 21.7% | 48.5pp |
| GPT-4o | 70.8% | 20.0% | 50.8pp |
| GPT-5.2 | 69.3% | 12.0% | 57.3pp |
| o3 | 67.2% | 12.0% | 55.2pp |
| GPT-5 | 62.9% | 9.3% | 53.5pp |
Full results for all 18 models are in the paper.
CGD improves consistency broadly:
- 16/17 models improved (94%)
- Mean improvement: +6.6pp SCR
- Best improvement: +19.7pp (GPT-4o)
- Also improves individual accuracy by +2.8pp on average
| Category | Pattern | Sets | Fallacy Tested |
|---|---|---|---|
| Contrapositive | P->Q equiv ~Q->~P | 85 | Affirming the consequent |
| Transitivity | A->B, B->C | - A->C | 85 |
| Syllogistic | forall x: A(x)->B(x) | 85 | Converse error |
| Negation | P vs ~P | 85 | Negation incoherence |
| Modus Tollens | P->Q, ~Q | - ~P | 85 |
| Commonsense | Everyday entailment | 68 | Reverse entailment |
ConsistencyBench/
├── data/ # Benchmark data
│ ├── consistency_bench.json # Full benchmark (493 sets, 1,904 questions)
│ └── consistency_bench_sample.json # Evaluation sample (300 sets)
├── src/ # Source code
│ ├── generate_benchmark.py # Benchmark generation
│ ├── evaluate.py # Evaluation pipeline (vanilla, CoT, CGD)
│ ├── run_all.py # Run all evaluations
│ ├── analyze_results.py # Results analysis and visualization
│ └── update_paper.py # Generate LaTeX tables and figures from results
├── results/ # All 37 evaluation result files (JSON)
│ ├── {model}_vanilla.json # 18 direct prompting results
│ ├── {model}_cgd.json # 17 CGD results
│ └── {model}_cot.json # 2 chain-of-thought results
├── eval_logs/ # Evaluation logs for all 37 runs
├── figures/ # Generated figures (PDF + PNG)
│ ├── figure1_overview.pdf # Figure 1: consistency problem illustration
│ ├── cgd_pipeline.pdf # CGD pipeline diagram
│ ├── category_heatmap.pdf # Per-category SCR heatmap
│ └── consistency_gap.pdf # Consistency gap visualization
├── scripts/ # Figure generation scripts
│ ├── generate_figure1.py # Figure 1 (matplotlib)
│ └── generate_cgd_diagram.py # CGD pipeline diagram (matplotlib)
├── final-paper/ # ICLR 2026 workshop paper
│ ├── main.tex # Paper source
│ ├── main.pdf # Compiled paper
│ ├── references.bib # Bibliography
│ └── ... # Style files and supporting TeX
└── LICENSE
- Individual Accuracy (IA): Fraction of individual questions answered correctly.
- Pairwise Consistency Rate (PCR): Fraction of within-set pairs where both are correct.
- Set-Level Consistency Rate (SCR): Fraction of sets where all questions are correct.
- Consistency Gap: IA - SCR. Measures how much single-query accuracy overestimates logical reasoning.
CGD is a training-free, model-agnostic inference-time method:
- Generate: The LLM produces an initial answer for each question.
- Check: An NLI checker (GPT-4o-mini) tests the answer against all prior answers for contradictions.
- Repair: If a contradiction is found, a revision prompt with the conflicting answer and shared premise elicits a corrected response.
CGD works with any LLM (including black-box APIs), requires no fine-tuning, and adds minimal overhead.
18 frontier LLMs across 6 families:
- OpenAI: GPT-5.2, GPT-5, GPT-5 Mini, GPT-4.1, GPT-4o, o3
- Anthropic: Claude Opus 4.6, Claude Sonnet 4.6, Claude Sonnet 4.5, Claude 3.5 Sonnet
- Google: Gemini 2.5 Pro, Gemini 2.5 Flash, Gemini 2.0 Flash
- DeepSeek: DeepSeek V3.2, DeepSeek-R1
- Meta: Llama 3.3 70B, Llama 3.1 70B
- Qwen: Qwen 2.5 72B
@inproceedings{bansal2026consistency,
title={Logical Consistency Under Pressure: Probing and Repairing Cross-Query Contradictions in {LLMs}},
author={Bansal, Aayam},
booktitle={ICLR 2026 Workshop on Logical Reasoning of Large Language Models},
year={2026}
}This project is licensed under the MIT License. See LICENSE for details.