Hi again! I was testing the Reward model for a project and I got confused when finding that the scores are not really bounded to 0 and 1
Not only some of the scores are negative, but also some are greater than 1.


This is the snippet I used to obtain the scores from ArmoRM. I thought that the scores would be bounded to that scale out of the box, but let me know if an additional transformation is needed!
def calculate_rewards(prompt, response, model, tokenizer):
"""
Calculates reward using both the prompt and response
["helpsteer-helpfulness","helpsteer-correctness","helpsteer-coherence",
"helpsteer-complexity","helpsteer-verbosity","ultrafeedback-overall_score",
"ultrafeedback-instruction_following", "ultrafeedback-truthfulness",
"ultrafeedback-honesty","ultrafeedback-helpfulness","beavertails-is_safe",
"prometheus-score","argilla-overall_quality","argilla-judge_lm","code-complexity",
"code-style","code-explanation","code-instruction-following","code-readability"]
"""
# preparing inputs for tokenizer
messages = [{"role": "user", "content": prompt},
{"role": "assistant", "content": response}]
device = model.device
input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt").to(device)
# multi-objective rewards for the response
with torch.no_grad():
output = model(input_ids)
multi_obj_rewards = output.rewards.cpu().float()
helpsteer_rewards_pred_norm = multi_obj_rewards[0, :5]
ultrafeedback_rewards_pred_norm = multi_obj_rewards[0, 5:10]
beavertails_rewards_pred_norm = multi_obj_rewards[0, 10:11]
other_rewards_pred_norm = multi_obj_rewards[0, 11:]
return {
"helpsteer-helpfulness": helpsteer_rewards_pred_norm[0].item(),
"helpsteer-correctness": helpsteer_rewards_pred_norm[1].item(),
"helpsteer-coherence": helpsteer_rewards_pred_norm[2].item(),
"helpsteer-complexity": helpsteer_rewards_pred_norm[3].item(),
"helpsteer-verbosity": helpsteer_rewards_pred_norm[4].item(),
"ultrafeedback-overall_score": ultrafeedback_rewards_pred_norm[0].item(),
"ultrafeedback-instruction_following": ultrafeedback_rewards_pred_norm[1].item(),
"ultrafeedback-truthfulness": ultrafeedback_rewards_pred_norm[2].item(),
"ultrafeedback-honesty": ultrafeedback_rewards_pred_norm[3].item(),
"ultrafeedback-helpfulness": ultrafeedback_rewards_pred_norm[4].item(),
"beavertails-safety": beavertails_rewards_pred_norm[0].item(),
"prometheus-score": other_rewards_pred_norm[0].item(),
"argilla-overall_quality": other_rewards_pred_norm[1].item(),
"argilla-judge_lm": other_rewards_pred_norm[2].item(),
"code-complexity": other_rewards_pred_norm[3].item(),
"code-style": other_rewards_pred_norm[4].item(),
"code-explanation": other_rewards_pred_norm[5].item(),
"code-instruction-following": other_rewards_pred_norm[6].item(),
"code-readability": other_rewards_pred_norm[7].item()
}
Hi again! I was testing the Reward model for a project and I got confused when finding that the scores are not really bounded to 0 and 1
Not only some of the scores are negative, but also some are greater than 1.
This is the snippet I used to obtain the scores from ArmoRM. I thought that the scores would be bounded to that scale out of the box, but let me know if an additional transformation is needed!