Skip to content

Question Regarding the Definition of "Intermediate Step" in Process-Level Rewards #8

@charlielam0615

Description

@charlielam0615

Hello,

First, thank you for your great work on the S2R paper. The methodology for teaching models to self-verify and self-correct is very insightful.

I'm writing to ask for clarification on the process-level reward mechanism, as I'm having trouble reconciling the description in the paper with the apparent logic. The confusion centers on how an intermediate [solve] action (s_j) is evaluated for correctness.

The paper states that the reward for s_j is based on V_golden(s_j), which appears to compare the output of this intermediate step to the overall ground-truth answer of the problem.

This leads to a confusing scenario. For example, in a multi-step math problem, an intermediate step s_j might be to correctly calculate a value like "the length of side A is 5". However, the final ground-truth answer for the problem might be "the area is 25".

If the system compares the intermediate result ("5") to the final ground-truth answer ("25"), it would assign a negative reward to a perfectly correct and necessary intermediate step. This seems counter-intuitive for training the model to learn a valid reasoning process.

Could you please clarify how V_golden(s_j) is implemented for these intermediate steps? Is there a different mechanism for evaluating the correctness of an intermediate calculation that doesn't rely on comparing it to the final answer?

Thank you for your time and for sharing your valuable research.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions