Support for reasoning models evaluation

When attempting to evaluate reasoning models with the MERA benchmark, I consistently receive a score of 0.

The root cause appears to be that `generate_until` stops generation prematurely.
If a stop sequence appears within the model's reasoning output or is part of the chat template for reasoning (e.g. "\n" for Qwen3), the generation is terminated before the model can produce its final answer. This prevents the answer from being parsed and correctly evaluated.

Questions:

1. Is there a recommended workaround in the current version of MERA to prevent premature stopping during the reasoning phase of a model's generation?
2. Are there plans to integrate or support this feature in MERA as a solution to this issue?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for reasoning models evaluation #14

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support for reasoning models evaluation #14

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions