Skip to content

Support for reasoning models evaluation #14

@anna-marshalova

Description

@anna-marshalova

When attempting to evaluate reasoning models with the MERA benchmark, I consistently receive a score of 0.

The root cause appears to be that generate_until stops generation prematurely.
If a stop sequence appears within the model's reasoning output or is part of the chat template for reasoning (e.g. "\n" for Qwen3), the generation is terminated before the model can produce its final answer. This prevents the answer from being parsed and correctly evaluated.

Questions:

  1. Is there a recommended workaround in the current version of MERA to prevent premature stopping during the reasoning phase of a model's generation?
  2. Are there plans to integrate or support this feature in MERA as a solution to this issue?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions