When attempting to evaluate reasoning models with the MERA benchmark, I consistently receive a score of 0.
The root cause appears to be that generate_until stops generation prematurely.
If a stop sequence appears within the model's reasoning output or is part of the chat template for reasoning (e.g. "\n" for Qwen3), the generation is terminated before the model can produce its final answer. This prevents the answer from being parsed and correctly evaluated.
Questions:
- Is there a recommended workaround in the current version of MERA to prevent premature stopping during the reasoning phase of a model's generation?
- Are there plans to integrate or support this feature in MERA as a solution to this issue?
When attempting to evaluate reasoning models with the MERA benchmark, I consistently receive a score of 0.
The root cause appears to be that
generate_untilstops generation prematurely.If a stop sequence appears within the model's reasoning output or is part of the chat template for reasoning (e.g. "\n" for Qwen3), the generation is terminated before the model can produce its final answer. This prevents the answer from being parsed and correctly evaluated.
Questions: