Clarification on Reward Usage in DPO Training

In the RLHF workflow paper, the Reward Model is used to annotate new data generated by the LLM during the iterative DPO process, resulting in scalar values. According to Algorithm 1, the traditional RM+RLHF process incorporates these scalar values into the loss function. For example, if the reward r is 8 versus 80, the outcomes will differ accordingly.
<img width="770" alt="image" src="https://github.com/user-attachments/assets/f4dc3d66-8e69-465b-b833-a5d97c8abe3e">

However, with the DPO method, the training objective is based on DPO loss, which does not explicitly calculate the reward scalar. It seems that the only information used is the preference — that A is preferred over B. The paper does not provide specific details on how this is handled.
<img width="747" alt="image" src="https://github.com/user-attachments/assets/3f3623c9-917a-431c-8b59-08aa42420831">

My question is: If we use the ArmoRM model for training with the iterative DPO method, will it still only use the information about which score is higher, rather than the actual scalar reward values? Is it sufficient to fully utilize the multi-object RM by just using it to label the preference pair?
<img width="892" alt="image" src="https://github.com/user-attachments/assets/0ee41127-0df7-4297-b039-5896481c6cc0">


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarification on Reward Usage in DPO Training #33

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Clarification on Reward Usage in DPO Training #33

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions