Skip to content

Question regarding ARMO stage2-train code #37

@RayWang-iat

Description

@RayWang-iat

Thank you very much for open-sourcing such an excellent work as ARMO. I am currently reproducing the code for stage2-train. Based on the data you provided, I only made two modifications. First, I replaced the preference data with Skywork/Skywork-Reward-Preference-80K-v0.2, and second, I replaced the reference data with Skywork/Skywork-Reward-Preference-80K-v0.2 as well. The final training results are shown below, and the results remain the same even if I adjust the training steps or learning rate, and there is a significant performance gap compared to the model you provided. Do you know what might be causing this?

Also, I obtained the .pt file by training according to your code. Could you please provide a merged version of the code so that the model I train can maintain the same structure as the RLHFlow/ArmoRM-Llama3-8B-v0.1 you provided? Thank you very much!

Evaluating model...
Validation accuracy: 0.8965
Saved gating network to xxx/gating_network_FsfairX-LLaMA3-RM-v0.1_6k1.pt 

Evaluating on RewardBench...

  df_acc = pd.concat([df_acc, pd.DataFrame(row)], ignore_index=True)
RewardBench Scores:
        Chat  Chat Hard     Safety  Reasoning  Score
0  99.162012  64.692981  89.099712  88.235938   85.3

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions