Question regarding ARMO stage2-train code

Thank you very much for open-sourcing such an excellent work as ARMO. I am currently reproducing the code for stage2-train. Based on the data you provided, I only made two modifications. First, I replaced the preference data with [Skywork/Skywork-Reward-Preference-80K-v0.2,](https://huggingface.co/datasets/Skywork/Skywork-Reward-Preference-80K-v0.2) and second, I replaced the reference data with Skywork/Skywork-Reward-Preference-80K-v0.2 as well. The final training results are shown below, and the results remain the same even if I adjust the training steps or learning rate, and there is a significant performance gap compared to the model you provided. Do you know what might be causing this?

Also, I obtained the `.pt` file by training according to your code. Could you please provide a merged version of the code so that the model I train can maintain the same structure as the RLHFlow/ArmoRM-Llama3-8B-v0.1 you provided? Thank you very much!

```
Evaluating model...
Validation accuracy: 0.8965
Saved gating network to xxx/gating_network_FsfairX-LLaMA3-RM-v0.1_6k1.pt 

Evaluating on RewardBench...

  df_acc = pd.concat([df_acc, pd.DataFrame(row)], ignore_index=True)
RewardBench Scores:
        Chat  Chat Hard     Safety  Reasoning  Score
0  99.162012  64.692981  89.099712  88.235938   85.3

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question regarding ARMO stage2-train code #37

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Question regarding ARMO stage2-train code #37

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions