Add Logic Stress Stress-test Suite (v2, v3) by 14H034160212 · Pull Request #1622 · openai/evals

14H034160212 · 2026-02-16T05:40:28Z

This PR adds two datasets (Variant 2 and Variant 3) from a logical reasoning robustness study. Variant 3 (Contradiction) evaluates 'Logic Inertia' in LLMs, where models often fail by blindly following rules despite explicit contradictions.

…r double-blind submission The abstract footnote previously linked to github.com/14H034160212/lemo and the OpenAI/Evals PR github.com/openai/evals/pull/1622, both of which would reveal the submitting author's GitHub identity to reviewers. Replace with the anonymous mirror at https://anonymous.4open.science/r/lemo-D708/README.md and drop the PR link entirely. The original links can be reinstated under \if@anonymous in the camera-ready / preprint build. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

14H034160212 · 2026-05-22T05:43:37Z

Hi maintainers,

A quick note on the failing CI: the check_files workflow is failing at the Install dependencies step with this error:

ERROR: Could not find a version that satisfies the requirement thinc<8.4.0,>=8.3.12
ERROR: Failed to build 'spacy' when installing build dependencies for spacy

This is a pre-existing upstream dependency issue, unrelated to the contents of this PR:

pyproject.toml declares requires-python = ">=3.9" and depends on spacy-universal-sentence-encoder, which transitively pulls in the latest spacy.
The latest spacy requires thinc>=8.3.12, but thinc>=8.3.12 requires Python>=3.10.
The CI workflow (.github/workflows/test_eval.yaml) pins Python 3.9, so pip install -e . cannot resolve a compatible spacy build environment.

This has been a known issue since November 2024:

Project installation fails: tensorflow conflicting dependencies #1567 ("Project installation fails: tensorflow conflicting dependencies") — open, same root cause
Update Python version to 3.12 and refresh PR template #1648 ("Update Python version to 3.12 and refresh PR template") — open, proposes bumping the CI to Python 3.12, which would fix this

This PR itself only adds four new files under evals/registry/data/logic_stress/ and evals/registry/evals/, matching the pattern of previously merged eval-add PRs such as #648 and #1268. It contains no code changes that could affect dependency resolution.

Could this PR be reviewed despite the CI failure, or alternatively could #1648 be prioritized so this PR's CI can re-run cleanly? Happy to rebase once #1648 lands.

Thanks!

Ein-Tim · 2026-05-22T08:11:46Z

@14H034160212 I would refrain from
putting any more work into this, this repository seems unmaintained, see #1665

Add Logic Stress Stress-test Suite (v2, v3)

4bd059a

14H034160212 requested review from andrew-openai, etr2460 and katyhshi as code owners February 16, 2026 05:40

14H034160212 force-pushed the logic-stress-evals branch 2 times, most recently from 543d0b5 to 3570f3b Compare February 17, 2026 08:50

14H034160212 force-pushed the logic-stress-evals branch from 3570f3b to 4bd059a Compare May 22, 2026 05:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Logic Stress Stress-test Suite (v2, v3)#1622

Add Logic Stress Stress-test Suite (v2, v3)#1622
14H034160212 wants to merge 1 commit into
openai:mainfrom
14H034160212:logic-stress-evals

14H034160212 commented Feb 16, 2026

Uh oh!

14H034160212 commented May 22, 2026

Uh oh!

Ein-Tim commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

14H034160212 commented Feb 16, 2026

Uh oh!

14H034160212 commented May 22, 2026

Uh oh!

Ein-Tim commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants