Add Logic Stress Stress-test Suite (v2, v3)#1622
Conversation
543d0b5 to
3570f3b
Compare
…r double-blind submission The abstract footnote previously linked to github.com/14H034160212/lemo and the OpenAI/Evals PR github.com/openai/evals/pull/1622, both of which would reveal the submitting author's GitHub identity to reviewers. Replace with the anonymous mirror at https://anonymous.4open.science/r/lemo-D708/README.md and drop the PR link entirely. The original links can be reinstated under \if@anonymous in the camera-ready / preprint build. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3570f3b to
4bd059a
Compare
|
Hi maintainers, A quick note on the failing CI: the This is a pre-existing upstream dependency issue, unrelated to the contents of this PR:
This has been a known issue since November 2024:
This PR itself only adds four new files under Could this PR be reviewed despite the CI failure, or alternatively could #1648 be prioritized so this PR's CI can re-run cleanly? Happy to rebase once #1648 lands. Thanks! |
|
@14H034160212 I would refrain from |
This PR adds two datasets (Variant 2 and Variant 3) from a logical reasoning robustness study. Variant 3 (Contradiction) evaluates 'Logic Inertia' in LLMs, where models often fail by blindly following rules despite explicit contradictions.